Region of Interest Extraction and Bilinear Interpolation Application for Preprocessing of Lipreading Systems

Jae Hyeok Han;Yong Ki Kim;Mi Hye Kim;

doi:10.3745/TKIPS.2024.13.4.189

The Transactions of the Korea Information Processing Society (정보처리학회 논문지)

Volume 13 Issue 4
/
Pages.189-198
/
2024
/
3022-7011(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Region of Interest Extraction and Bilinear Interpolation Application for Preprocessing of Lipreading Systems

입 모양 인식 시스템 전처리를 위한 관심 영역 추출과 이중 선형 보간법 적용

한재혁 (충북대학교 컴퓨터공학과 ) ;
김용기 (충북대학교 ) ;
김미혜 (충북대학교 컴퓨터공학과 )

Received : 2024.02.05
Accepted : 2024.03.08
Published : 2024.04.30

https://doi.org/10.3745/TKIPS.2024.13.4.189 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Lipreading is one of the important parts of speech recognition, and several studies have been conducted to improve the performance of lipreading in lipreading systems for speech recognition. Recent studies have used method to modify the model architecture of lipreading system to improve recognition performance. Unlike previous research that improve recognition performance by modifying model architecture, we aim to improve recognition performance without any change in model architecture. In order to improve the recognition performance without modifying the model architecture, we refer to the cues used in human lipreading and set other regions such as chin and cheeks as regions of interest along with the lip region, which is the existing region of interest of lipreading systems, and compare the recognition rate of each region of interest to propose the highest performing region of interest In addition, assuming that the difference in normalization results caused by the difference in interpolation method during the process of normalizing the size of the region of interest affects the recognition performance, we interpolate the same region of interest using nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation, and compare the recognition rate of each interpolation method to propose the best performing interpolation method. Each region of interest was detected by training an object detection neural network, and dynamic time warping templates were generated by normalizing each region of interest, extracting and combining features, and mapping the dimensionality reduction of the combined features into a low-dimensional space. The recognition rate was evaluated by comparing the distance between the generated dynamic time warping templates and the data mapped to the low-dimensional space. In the comparison of regions of interest, the result of the region of interest containing only the lip region showed an average recognition rate of 97.36%, which is 3.44% higher than the average recognition rate of 93.92% in the previous study, and in the comparison of interpolation methods, the bilinear interpolation method performed 97.36%, which is 14.65% higher than the nearest neighbor interpolation method and 5.55% higher than the bicubic interpolation method. The code used in this study can be found a https://github.com/haraisi2/Lipreading-Systems.

입 모양 인식은 음성 인식의 중요 부분 중 하나로 음성 인식을 위한 입 모양 인식 시스템에서 입 모양 인식 성능을 개선하기 위한 여러 연구가 진행됐다. 최근의 연구에서는 인식 성능을 개선하기 위해 입 모양 인식 시스템의 모델 구조를 수정하는 방법이 사용됐다. 본 연구에서는 모델 구조를 수정하는 것으로 인식 성능을 개선하는 기존의 연구와 달리 모델 구조의 변화 없이 인식 성능을 개선하는 것을 목표로 한다. 모델 구조의 수정 없이 인식 성능을 개선하기 위해, 사람이 하는 입 모양 인식에서 사용되는 단서를 참고해 입 모양 인식 시스템의 기존 관심 영역인 입술 영역과 함께 턱, 뺨과 같은 다른 영역을 관심 영역으로 설정하고 각 관심 영역의 인식률을 비교해 가장 높은 성능의 관심 영역을 제안한다. 또한, 관심 영역 크기를 정규화하는 과정에서 보간법의 차이로 인해 발생하는 정규화 결과의 차이가 인식 성능에 영향을 준다고 가정하고 최근접 이웃 보간법, 이중 선형 보간법, 이중 삼차 보간법을 사용해 동일한 관심 영역을 보간하고 각 보간법에 따른 입 모양 인식률을 비교해 가장 높은 성능의 보간법을 제안한다. 각 관심 영역은 객체 탐지 인공신경망을 학습시켜 검출하고, 각 관심 영역을 정규화하고 특징을 추출하고 결합한 뒤, 결합된 특징들을 차원 축소한 결과를 저차원 공간으로 매핑하는 것으로 동적 정합 템플릿을 생성했다. 생성된 동적 정합 템플릿들과 저차원 공간으로 매핑된 데이터의 거리를 비교하는 것으로 인식률을 평가했다. 실험 결과 관심 영역의 비교에서는 입술 영역만을 포함하는 관심 영역의 결과가 이전 연구의 93.92%의 평균 인식률보다 3.44% 높은 97.36%의 평균 인식률을 보였으며, 보간법의 비교에서는 이중 선형 보간법이 97.36%로 최근접 이웃 보간법에 비해 14.65%, 이중 삼차 보간법에 비해 5.55% 높은 성능을 나타내었다. 본 연구에 사용된 코드는 https://github.com/haraisi2/Lipreading-Systems에서 확인할 수 있다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 지역지능화혁신인재양성사업임(IITP-2024-2020-0-01462).

References

Z. Zhang, J. Geiger, J. Pohjalainen, A. E. D. Mousa, W. Jin, and B. Schuller, "Deep learning for environmentally robust speech recognition: An overview of recent developments," ACM Transactions on Intelligent Systems and Technology, Vol.9, No.49, pp.1-28. 2018. https://doi.org/10.1145/3178115
C. Bregler and Y. Koing, ""Eigenlips" for robust speech recognition," in Proceedings of the ICASSP'94. IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, Vol.2, pp.669-672, 1994.
U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel, "Towards unrestricted lip reading," International Journal of Pattern Recognition and Artificial Intelligence, Vol.14, No.5, pp.571-585, 2000. https://doi.org/10.1142/S0218001400000374
Y. G. Kim, "Feature selection method for speaker independent lip reading on noisy environments," Ph.D. dissertation, Chungbuk National University, Cheongju, Korea, 2019.
한민경, "독화에 청각적으로 제공된 기본 주파수(F0) 보완정보," Communication Sciences & Disorders, Vol.1, No.1, pp.150-177, 1996.
D. G. Stork and M. E. Hennecke, "Speechreading by humans and machines: models, systems, and applications," Berlin: Springer Science & Business Media, pp.525-531, 1996.
B. Martinez, P. Ma, S. Petridis, and M. Pantic, "Lipreading using temporal convolutional networks," in Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, pp.6319-6323, 2020.
P. Ma, B. Martinez, and M. Pantic, "Towards practical lipreading with distilled and efficient models," in Proceedings of the ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, pp.7608-7612, 2021.
K. Vayadande, T. Adsare, N. Agrawal, T. Dharmik, A. Patil, and S. Zod, "LipReadNet: A deep learning approach to lip reading," in Proceedings of the 2023 International Conference on Applied Intelligence and Sustainable Computing, Dharwad, pp.1-6, 2023.
최병문, "구화교육," 한국구화학교, 1970.
M. Hao, M. Mamut, N. Yadikar, A.Aysa, and K. Ubul, "A survey of research on lipreading technology," IEEE Access, Vol.8, pp.204518-204544, 2020. https://doi.org/10.1109/ACCESS.2020.3036865
김민정, "임상중심 말소리장애." 1st ed, Seoul: 학지사, 2021.
J. J. O'Neill and H. J. Oyer, "Visual communication for the hard of hearing: History, research, and methods," 2nd ed., New Jersey: Prentice Hall, 1981.
S. H. Cho and C. D. Choi, "Viseme and its teaching strategy for speech-reading and language normalization of people with hearing loss," Audiology and Speech Research, Vol.14, No.4, pp.219-226, 2018. https://doi.org/10.21848/asr.2018.14.4.219
G. Potamianos and C. Neti, "Improved ROI and within frame discriminant features for lipreading," in Proceedings of the 2001 International Conference on Image Processing, Thessaloniki, Vol.3, pp.250-253, 2001.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, pp.779-788, 2016.
J. Redmon and A. Farhadi, "Yolov3: An Incremental Improvement," Computer Vision and Pattern Recognition, Vol.1804, pp.1-6, 2018.
G. Jocher and A. Chaurasia, "Ultralytics YOLOv8 Docs" [Internet], https://docs.ultralytics.com/ko
J. Luettin and N. A. Thacker, "Speechreading using probabilistic models," Computer Vision and Image Understanding, Vol.65, No.2, pp.163-178, 1997. https://doi.org/10.1006/cviu.1996.0570
Y. Lan, B. J. Theobald, R. Harvey, E. J. Ong, and R. Bowden, "Improving visual features for lip-reading," in Proceedings of the Auditory-visual Speech Processing 2010, Hakone, paper S7-3, 2010.
B. Sujatha and T. Santhanam, "A novel approach inter-grating geometric and Gabor wavelet approaches to improvise visual lip-reading," International Journal of Soft Computing (IJSC), Vol.5, pp.13-18, 2010. https://doi.org/10.3923/ijscomp.2010.13.18
M. Z. Ibrahim and D. J. Mulvaney, "Robust geometrical-based lip-reading using Hidden Markov models," in Proceedings of the EUROCON 2013, Zagreb, pp.2011-2016, 2013.
박혜영, 이관용, "패턴 인식과 기계학습," 1st ed., Gyeonggi-do: 이한출판사. 2011.
박창순, 이광용, 이형석, 정호영, "생활 속의 임베디드 소프트웨어", 1st ed., Seoul: U-북, 2007.
A. Koumparoulis, G. Potamianos, Y. Mroueh, and S. J. Rennie, "Exploring ROI size in deep learning based lipreading," in Proceedings of the Auditory-visual Speech Processing 2017, Stockholm, pp.64-69, 2017.