인공 지능 기술을 이용한 음성 인식 기술에 대한 고찰

A Study on Speech Recognition Technology Using Artificial Intelligence Technology

  • 이영조 (협성대학교 소프트웨어공학과) ;
  • 이기승 (건국대학교 전기전자공학부) ;
  • 강성진 (한국기술교육대학교 전기전자통신공학부)
  • Young Jo Lee (Department of Software Engineering, Hyupsung Universiy) ;
  • Ki Seung Lee (School of Electrical and Electronic Engineering, Konkuk Universiy) ;
  • Sung Jin Kang (School of Electrical, Electronics & Communication Engineering, Korea University of Technology and Education )
  • 투고 : 2024.09.10
  • 심사 : 2024.09.14
  • 발행 : 2024.09.30

초록

This paper explores the recent advancements in speech recognition technology, focusing on the integration of artificial intelligence to improve recognition accuracy in challenging environments, such as noisy or low-quality audio conditions. Traditional speech recognition methods often suffer from performance degradation in noisy settings. However, the application of deep neural networks (DNN) has led to significant improvements, enabling more robust and reliable recognition in various industries, including banking, automotive, healthcare, and manufacturing. A key area of advancement is the use of Silent Speech Interfaces (SSI), which allow communication through non-speech signals, such as visual cues or other auxiliary signals like ultrasound and electromyography, making them particularly useful for individuals with speech impairments. The paper further discusses the development of multi-modal speech recognition, combining both audio and visual inputs, which enhances recognition accuracy in noisy environments. Recent research into lip-reading technology and the use of deep learning architectures, such as CNN and RNN, has significantly improved speech recognition by extracting meaningful features from video signals, even in difficult lighting conditions. Additionally, the paper covers the use of self-supervised learning techniques, like AV-HuBERT, which leverage large-scale, unlabeled audiovisual datasets to improve performance. The future of speech recognition technology is likely to see further integration of AI-driven methods, making it more applicable across diverse industries and for individuals with communication challenges. The conclusion emphasizes the need for further research, especially in languages with complex morphological structures, such as Korean

키워드

참고문헌

  1. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings bury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
  2. A. Fernandez-Lopez, F. M. Sukno, "Survey on automatic lip-reading in the era of deep learning," Image and Vision Computing, vol. 78, pp. 53-72. 2018. https://doi.org/10.1016/j.imavis.2018.07.002
  3. J. A. Gonzalez-Lopez, A. Gomez-Alanis, J. M. Martin Donas, J. L. Perez-Cordoba and A. M. Gomez, "Silent Speech Interfaces for Speech Restoration: A Review," in IEEE Access, vol. 8, pp. 177995-178021, 2020. https://doi.org/10.1109/ACCESS.2020.3026579
  4. M. Hao, M. Mamut, N. Yadikar, A. Aysa and K. Ubul, "A Survey of Research on Lipreading Technology," in IEEE Access, vol. 8, pp. 204518-204544, 2020. https://doi.org/10.1109/ACCESS.2020.3036865
  5. S. Fenghour, D. Chen, K. Guo, B. Li and P. Xiao, "Deep Learning-Based Automated Lip-Reading: A Survey," in IEEE Access, vol. 9, pp. 121184-121205, 2021. https://doi.org/10.1109/ACCESS.2021.3107946
  6. K. Paliwal, K. Wojcicki, B. Shannon, "The importance of phase in speech enhancement," Speech Communication, vol. 53, No. 4, pp. 465-494, 2011. https://doi.org/10.1016/j.specom.2010.12.003
  7. M. Wollmer, B. Schuller, F. Eyben and G. Rigoll, "Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening," in IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 867-881, 2010. https://doi.org/10.1109/JSTSP.2010.2057200
  8. J. T. Geiger, F. Weninger, J. F. Gemmeke, M. Wollmer, B. Schuller and G. Rigoll, "Memory-Enhanced Neural Networks and NMF for Robust ASR," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 6, pp. 1037-1046, 2014. https://doi.org/10.1109/TASLP.2014.2318514
  9. Y. Qian, M. Bi, T. Tan, and K. Yu, "Very deep convolutional neural networks for noise robust speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263-2276, 2016. https://doi.org/10.1109/TASLP.2016.2602884
  10. G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012. https://doi.org/10.1109/TASL.2011.2134090
  11. G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 5200-5204, 2016.
  12. H. Joo, K. Lee, "Estimating speech parameters for ultrasonic Doppler signal using LSTM recurrent neural networks," The Journal of the Acoustical Society of Korea, vol.38, no.4, pp. 433-441, 2019. https://doi.org/10.7776/ASK.2019.38.4.433
  13. Z. Zhang, N. Cummins and B. Schuller, "Advanced Data Exploitation in Speech Analysis: An overview," in IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 107-129, July 2017.
  14. K. Lee, "An acoustic Doppler-based silent speech interface technology using generative adversarial networks'" The Journal of the Acoustical Society of Korea. vol.40, no.2, pp. 161-168, 2021. https://doi.org/10.7776/ASK.2021.40.2.161
  15. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE Signal Processing Magazine, vol. 35, pp. 53-65, 2018. https://doi.org/10.1109/MSP.2017.2765202
  16. Julius Richter, Simon Welker, J-M Lemercier, Bunlong Lay, Timo Gerkmann, "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364. 2023. https://doi.org/10.1109/TASLP.2023.3285241
  17. C. -S. Lin, S. -F. Chang, C. -C. Chang and C. -C. Lin, "Microwave Human Vocal Vibration Signal Detection Based on Doppler Radar Technology," in IEEE Transactions on Microwave Theory and Techniques, vol. 58, no. 8, pp. 2299-2306, Aug. 2010. https://doi.org/10.1109/TMTT.2010.2052968
  18. B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, J.S. Brumberg, "Silent speech interfaces," Speech Communication, vol. 52, No. 4, pp. 270-287, 2010. https://doi.org/10.1016/j.specom.2009.08.002
  19. T. Le Cornu and B. Milner, "Generating Intelligible Audio Speech from Visual Speech," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1751-1761, 2017. https://doi.org/10.1109/TASLP.2017.2716178
  20. K. Lee, "Ultrasonic Doppler Based Silent Speech Interface Using Perceptual Distance," Applied Sciences. 12(2), 827, 2022.
  21. K. Lee, "Speech enhancement using ultrasonic doppler sonar", Speech Communication, Vol. 110, pp. 21-32, July 2019. https://doi.org/10.1016/j.specom.2019.03.008
  22. K. Lee, "Silent Speech Interface Using Ultrasonic Doppler Sonar," EICE Transactions on Information and Systems, vol. E103.D, no. 8, pp. 1875-1887, 2020. https://doi.org/10.1587/transinf.2019EDP7211
  23. T. Toda, M. Nakagiri and K. Shikano, "Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2505-2517, Nov. 2012. https://doi.org/10.1109/TASL.2012.2205241
  24. M. Janke and L. Diener, "EMG-to-Speech: Direct Generation of Speech From Facial Electromyographic Signals," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2375- 2385, Dec. 2017. https://doi.org/10.1109/TASLP.2017.2738568
  25. G. Shin, J. Kim, "A Study on the Intelligent Recognition of a Various Electronic Components and Alignment Method with Vision," Journal of the Semiconductor & Display Technology, vol. 23, no. 2, pp. 1-5, 2024.
  26. X. Tan and B. Triggs, "Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions," in IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1635-1650, 2010. https://doi.org/10.1109/TIP.2010.2042645
  27. A. Chavarin, E. Cuevas, O. Avalos, J. Galvez and M. Perez-Cisneros, "Contrast Enhancement in Images by Homomorphic Filtering and Cluster-Chaotic Optimization," in IEEE Access, vol. 11, pp. 73803-73822, 2023. https://doi.org/10.1109/ACCESS.2023.3287559
  28. P. -H. Lee, S. -W. Wu and Y. -P. Hung, "Illumination Compensation Using Oriented Local Histogram Equalization and its Application to Face Recognition," in IEEE Transactions on Image Processing, vol. 21, no. 9, pp. 4280-4289, Sept. 2012. https://doi.org/10.1109/TIP.2012.2202670
  29. M. Zheng, G. Qi, Z. Zhu, Y. Li, H. Wei and Y. Liu, "Image Dehazing by an Artificial Image Fusion Method Based on Adaptive Structure Decomposition," in IEEE Sensors Journal, vol. 20, no. 14, pp. 8062-8072, 15 July15, 2020. https://doi.org/10.1109/JSEN.2020.2981719
  30. D. Sugimura, T. Mikami, H. Yamashita and T. Hamamoto, "Enhancing Color Images of Extremely Low Light Scenes Based on RGB/NIR Images Acquisition With Different Exposure Times," in IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3586-3597, Nov. 2015. https://doi.org/10.1109/TIP.2015.2448356
  31. Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, R. Zimmermann, "Lipper: Synthesizing thy speech using multi-view lipreading," in Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 2588-2595, 2019.
  32. K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, "Video-driven speech reconstruction using generative adversarial networks," in Proc. Interspeech, Sep. 2019, pp. 4125-4129.
  33. M. Wand, J. Koutnik, and J. Schmidhuber, "Lipreading with Long Short-Term Memory," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 6115-6119.
  34. A. Ephrat and S. Peleg, "Vid2Speech: Speech reconstruction from silent video," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 5095-5099.
  35. H. Akbari, H. Arora, L. Cao, and N. Mesgarani, "Lip2Audspec: Speech reconstruction from silent lip movements video," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 2516- 2520.
  36. T. Stafylakis and G. Tzimiropoulos, "Combining residual networks with LSTMs for lipreading," in Proc. Interspeech, Aug. 2017, pp. 3652-3656.
  37. B. Martinez, P. Ma, S. Petridis, and M. Pantic, "Lipreading using temporal convolutional networks," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 6319-6323.
  38. K. Lee, "Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques," Electronics, 13(6), 1032, March 2024.
  39. M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, "Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788-1800. 2020. https://doi.org/10.1109/TASLP.2020.3000593
  40. X. Qian, Z. Wang, J. Wang, G. Guan, and H. Li, "Audio-visual cross-attention networks for robotic speaker tracking," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 550-562. 2023.
  41. Lei. Liu, Li Liu, and H. Li, "Computation and Parameter Efficient Multi-Modal fusion Transformer for Cued Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1559-1572. 2024. https://doi.org/10.1109/TASLP.2024.3363446
  42. B. Shi, W. Hsu, A. Mohamed, "Robust Self-Supervised Audio-Visual Speech Recognition," arXiv:2201.01763, 2022.
  43. L. Qu, C. Weber, S. Wermter, "LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading," IEEE Transactions on Neural Networks and Learning systems, vol. 35, no. 2, pp. 2772-2782, 2024. https://doi.org/10.1109/TNNLS.2022.3191677
  44. T. Afouras, J. Chung, A. Senior, O. Vinyals, A. Zisserman, "Deep Audio-Visual Speech Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717-8727, 2022. https://doi.org/10.1109/TPAMI.2018.2889052
  45. C. Xie, T. Toda, "Noisy-to-Noisy Voice Conversion Under Variations of Noisy Condition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3871-3882. 2023. https://doi.org/10.1109/TASLP.2023.3313426
  46. J. Devlin, M. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 2019.
  47. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A lite BERT for self-supervised learning of language representations," ICLR 2020 Conference, 2019, arXiv:1909.11942.
  48. B. Chen et al., "Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 7992-8001.