Browse > Article
http://dx.doi.org/10.5370/KIEE.2014.63.6.813

Monosyllable Speech Recognition through Facial Movement Analysis  

Kang, Dong-Won (Dept. of Biomedical Engineering, Konkuk University)
Seo, Jeong-Woo (Dept. of Biomedical Engineering, Konkuk University)
Choi, Jin-Seung (Dept. of Biomedical Engineering, Konkuk University)
Choi, Jae-Bong (Department of Mechanical Systems Engineering, Hansung University)
Tack, Gye-Rae (Dept. of Biomedical Engineering, BK21+ Research Institute of Biomedical Engineering, Konkuk University)
Publication Information
The Transactions of The Korean Institute of Electrical Engineers / v.63, no.6, 2014 , pp. 813-819 More about this Journal
Abstract
The purpose of this study was to extract accurate parameters of facial movement features using 3-D motion capture system in speech recognition technology through lip-reading. Instead of using the features obtained through traditional camera image, the 3-D motion system was used to obtain quantitative data for actual facial movements, and to analyze 11 variables that exhibit particular patterns such as nose, lip, jaw and cheek movements in monosyllable vocalizations. Fourteen subjects, all in 20s of age, were asked to vocalize 11 types of Korean vowel monosyllables for three times with 36 reflective markers on their faces. The obtained facial movement data were then calculated into 11 parameters and presented as patterns for each monosyllable vocalization. The parameter patterns were performed through learning and recognizing process for each monosyllable with speech recognition algorithms with Hidden Markov Model (HMM) and Viterbi algorithm. The accuracy rate of 11 monosyllables recognition was 97.2%, which suggests the possibility of voice recognition of Korean language through quantitative facial movement analysis.
Keywords
3-D motion capture system; Facial motion; Hidden Markov model; Speech recognition;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 T. Chen, H. P. Graf, and K. Wang, "Speech-assisted video processing: Interpolation and low-bitrate coding," 28th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar '94), pp. 957-979, 1994.
2 G. Baily, E. Vatikiotis-Bateson, and P. Perrier, Visual and audio-visual speech processing, MIT press, 2004.
3 A. Bagai, H. Gandhi, R. Goyal, M. Kohli, and T. V. Prasad, "Lip-Reading using Neural Networks, International Journal of Computer Science and Network Security," vol. 9, no. 4, pp. 108-111, 2009.
4 H. Mehrotra, G. Agrawal, and M. C. Srivastava, "Automatic Lip Contour Tracking and Visual Character Recognition for Computerized Lip Reading," International Journal of Computer Science, vol. 4, no. 1, pp. 62-71, 2009.
5 W. J. Ma, X. Zhou, L. A. Ross, J. J. Foxe, and L. C. Parra, "Lip reading aids word recognition most in moderate noise: a Bayesian explanation using highdimensional feature space," PLoS ONE, vol. 4, no. 3, pp. 1-14, 2009.   DOI   ScienceOn
6 J. J. Shin, J. Lee, and D. J. Kim, "Real-time lip reading system for isolated Korean word recognition," Pattern Recognition, vol. 44, pp. 559-571, 2011.   DOI   ScienceOn
7 M. G. Song, T. P. Thanh, J. Y. Kim, and S.T. Hwang, "A Study on Lip Detection based on Eye Localization for Visual Speech Recognition in Mobile Environment," Journal of Korean institute of intelligent systems, vol. 19, no. 4, pp. 478-484, 2009.   과학기술학회마을   DOI   ScienceOn
8 Y. T. Won, H. D. Kim, M. R. Lee, B. S. Jang, and H. S. Kwak, "A Character Speech Animation System for Language Education for Each Hearing Impaired Person, Journal of digital contents society, vol. 9, no. 3, pp. 389-398, 2008.   과학기술학회마을
9 K. H. Lee, J. J. Kum, and S. B. Rhee, "Design & Implementation of Lipreading System using the Articulatory Controls Analysis of the Korean 5 Vowels," The Journal of Korean association of computer education, vol. 8, no. 4, pp. 281-288, 2007.   과학기술학회마을
10 J. Ma, R. Cole, B. Pellom, W. Ward, and B. Wise, "Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data," IEEE Transactions on Visualization and computer Graphics, vol. 12, no. 2, pp. 266-276, 2006.   DOI   ScienceOn
11 G. Bailly, F. Elisei, M. Odisio, D. Pele, D. Caillière, and K. Grein-Cochard, "Talking faces for MPEG-4 compliant scalable face-to-face telecommunication," Proceedings of the Smart Objects Conference (SOC '03), pp. 204-207, 2003.
12 P. Scanlon, and R. Reilly, "Feature analysis for automatic speech reading," Proc. of the IEEE Int. Conf. on Multimedia Signal Processing (MMSP '01), pp. 625-630, 2001.
13 R. SCOTT, "Sparking life: notes on the performance capture sessions for the lord of the rings: the two towers," ACM SIG-GRAPH Computer Graphics, vol. 37, no. 4, pp. 17-21, 2003.   DOI
14 Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin, "Real-time speech motion synthesis from recorded motions," In Proceedings of Eurographics/SIGGRAPH Symposium on Computer Animation (SCA '04), pp. 345-353, 2004.
15 S. Dupont, and J. Luettin, "Audio-visual speech modeling for continuous speech recognition," In IEEE Trans-actions on Multimedia, vol. 2, pp. 141-151, 2000.   DOI   ScienceOn
16 C. G. Lee, I. M. So, Y. U. Kim, J. R. Kim, S. K. Kang, and S. T. Jung, "Implementation of three dimension lip reading system using stereo vision," Proceedings of Korea multimedia society conference (KMMS '04), pp. 489-492, 2004.
17 H. S. Koh, S. M. Han, J. U. Chu, S. H. Park, J. B. Choi, G. W. Choi, D. S. Hwang, and I. C. Youn, "The three-dimensional lip shape tracking system using stereo camera," Proceedings of the Korean Society of Precision Engineering (KSPE '11) Conference, pp. 979-980, 2011.   과학기술학회마을
18 K. H. Lee, R. Yong, and S. O. Kim, "A study on speechreading the Korean 8 vowels," Journal of the Korea society of computer and information, vol. 14, no. 3, pp. 173-182, 2009.   과학기술학회마을
19 J. Y. Kim, S. H. Min, and S. H. Choi, "Robustness of Bimodal Speech Recognition on Degradation of Lip Parameter Estimation Performance," Journal of the Korean Society of Phonetic Science and Speech Technology, vol. 10, no. 2, pp. 29-33, 2003.   과학기술학회마을
20 K. H. Nam, and C. S. Bae, "A study on the lip shape recognition algorithm using 3-D Model," The Journal of the Korean Institute of Maritime Information & Communication Sciences, vol. 6, no. 5, pp. 783-788, 2002.   과학기술학회마을
21 G. Galatas, G. Potamianos, D. Kosmopoulos, C. McMurrough, and F. Makedon, "Bilingual Corpus for AVASR using Multiple Sensors and Depth Information," Auditory-Visual Speech Processing (AVSP '11), pp. 103-106, 2011.
22 I. S. Pandzic, and R. Forchheimer, MPEG-4 Facial Animation: The Standard, Implementation, and Applications, John Wiley and Sons, Inc., New York, 2002.
23 A. Srinivasan, "Speech Recognition Using Hidden Markov Model," Applied Mathematical Sciences, vol. 5, no. 79, pp. 3943-3948, 2011.
24 D. A. Pierre, Optimization Theory with Applications, Dover Publications, Inc., New York, 1986.
25 N. Eveno, A. Capiler, and P. Y. Coulon, "Accurate and quasi-automatic lip tracking," IEEE Transactions of Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 706-715, 2004.   DOI   ScienceOn
26 X. D. Huang, Y. Ariki, and M.A. Jack, Hidden Markov Models for Speech Recognition, Edinburgh Univ. Press, Edinburgh, 1990.