Browse > Article
http://dx.doi.org/10.9708/jksci.2021.26.10.037

Performance Comparison of Korean Dialect Classification Models Based on Acoustic Features  

Kim, Young Kook (Dept. of Software, Soongsil University)
Kim, Myung Ho (Dept. of Software, Soongsil University)
Abstract
Using the acoustic features of speech, important social and linguistic information about the speaker can be obtained, and one of the key features is the dialect. A speaker's use of a dialect is a major barrier to interaction with a computer. Dialects can be distinguished at various levels such as phonemes, syllables, words, phrases, and sentences, but it is difficult to distinguish dialects by identifying them one by one. Therefore, in this paper, we propose a lightweight Korean dialect classification model using only MFCC among the features of speech data. We study the optimal method to utilize MFCC features through Korean conversational voice data, and compare the classification performance of five Korean dialects in Gyeonggi/Seoul, Gangwon, Chungcheong, Jeolla, and Gyeongsang in eight machine learning and deep learning classification models. The performance of most classification models was improved by normalizing the MFCC, and the accuracy was improved by 1.07% and F1-score by 2.04% compared to the best performance of the classification model before normalizing the MFCC.
Keywords
Machine Learning; Deep Learning; MFCC; Dialect Classification; Speech Analysis;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Thomas Purnell, William Idsardi, and John Baugh. "Perceptual and phonetic experiments on american english dialect identification", Journal of language and social psychology, 18(1):10-30, 1999.   DOI
2 Li, Ming, et al. "Spoken language identification using score vector modeling and support vector machine." Eighth Annual Conference of the International Speech Communication Association. pp. 350-353, 2007.
3 S. S. Jo and Y. G. Kim, "AI (Artificial Intelligence) Voice Assistant Evolving to Platform", IITP, pp. 1-25, Feb. 2017
4 Rongqing Huang and John HL Hansen, "Dialect/accent classification via boosted word modeling", In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings (ICASSP'05), volume 1, pages I-585. IEEE, 2005
5 Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P, "Front-end factor analysis for speaker verification", IEEE Trans. Audio Speech Lang. Process. 19, pp. 788-798, August 2010, DOI: 10.1109/TASL.2010.2064307   DOI
6 Dehak, N., Torres-Carrasquillo, P., Reynolds, D., and Dehak, R, "Language recognition via ivectors and dimensionality reduction" in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (San Francisco, CA), pp. 857-860, August 2011.
7 Song, Y., Jiang, B., Bao, Y., Wei, S., and Dai, L.-R, "I-vector representation based on bottleneck features for language identification", Electron. Lett. 49, pp. 1569-1570, 2013, DOI: 10.1049/el.2013.1721   DOI
8 C. Themistocleous, "Dialect Classification From a Single Sonorant Sound Using Deep Neural Networks" frontiers om Communication, November 2019. DOI: 10.3389/fcomm.2019.00064
9 Nagaratna B. Chittaragi; Shashidhar G. Koolagudi, "Acoustic features based word level dialect classification using SVM and ensemble methods" IEEE Trans, In 2017 Tenth International Conference on Contemporary Computing (IC3), pp. 1-6, August 2017, DOI: 10.1109/IC3.2017.8284315.   DOI
10 Li, Ming, Chi-Sang Jung, and Kyu J. Han. "Combining five acoustic level modeling methods for automatic speaker age and gender recognition." Eleventh Annual Conference of the International Speech Communication Association. pp. 2526-2829, 2010.
11 Ghahremani, P., Nidadavolu, P. S., Chen, N., Villalba, J., Povey, D., Khudanpur, S., & Dehak, N. "End-to-end Deep Neural Network Age Estimation." In INTERSPEECH, pp. 277-281, December 2018.
12 Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn. "Speaker verification using adapted Gaussian mixture models." Digital signal processing 10.1-3, pp. 19-41, 2000.   DOI
13 Stolcke, Andreas, et al. "Speaker recognition with session variability normalization based on MLLR adaptation transforms." IEEE Transactions on Audio, Speech, and Language Processing 15.7, pp. 1987-1998, 2007.   DOI
14 Qawaqneh, Zakariya, Arafat Abu Mallouh, and Buket D. Barkana. "Deep neural network framework and transformed MFCCs for speaker's age and gender classification." Knowledge-Based Systems 115, pp. 5-14, 2017.   DOI
15 Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. "X-vectors: Robust dnn embeddings for speaker recognition.", In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5329-5333, April 2018.
16 Hourri, Soufiane, and Jamal Kharroubi. "A deep learning approach for speaker recognition." International Journal of Speech Technology 23.1, pp. 123-131, 2020.   DOI
17 AI Hub, Korean Conversation Voice, https://aihub.or.kr/aidata/7968.
18 Mallouh, Arafat Abu, Zakariya Qawaqneh, and Buket D. Barkana. "New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification." Neural Computing and Applications 30.8, pp. 2581-2593, 2018.   DOI
19 S. Gopal Krishna Patro, Kishore Kumar Sahu, "Normalization: A Preprocessing Stage" arXiv preprint arXiv:1503.06462 (2015).
20 Goutte, Cyril, and Eric Gaussier. "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation." European conference on information retrieval. Springer, Berlin, Heidelberg, pp. 345-359, 2005, DOI: 10.1007/978-3-540-31865-1_25.   DOI
21 Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., and Khudanpur, S, "Spoken language recognition using x-vectors," in Proceedings of Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 105-111, 2018, DOI: 10.21437/Odyssey.2018-15   DOI
22 Park Jeon-gyu, "Deep Learning-based Speech Recognition Technology", http://www.itdaily.kr/news/articleView.html?idxno=76405