DOI QR코드

DOI QR Code

Dialect classification based on the speed and the pause of speech utterances

발화 속도와 휴지 구간 길이를 사용한 방언 분류

  • Jonghwan Na (Department of Electrical and Computer Engineering, Inha University) ;
  • Bowon Lee (Department of Electrical and Computer Engineering, Inha University)
  • 나종환 (인하대학교 전기컴퓨터공학과) ;
  • 이보원 (인하대학교 전기컴퓨터공학과)
  • Received : 2023.05.25
  • Accepted : 2023.06.14
  • Published : 2023.06.30

Abstract

In this paper, we propose an approach for dialect classification based on the speed and pause of speech utterances as well as the age and gender of the speakers. Dialect classification is one of the important techniques for speech analysis. For example, an accurate dialect classification model can potentially improve the performance of speaker or speech recognition. According to previous studies, research based on deep learning using Mel-Frequency Cepstral Coefficients (MFCC) features has been the dominant approach. We focus on the acoustic differences between regions and conduct dialect classification based on the extracted features derived from the differences. In this paper, we propose an approach of extracting underexplored additional features, namely the speed and the pauses of speech utterances along with the metadata including the age and the gender of the speakers. Experimental results show that our proposed approach results in higher accuracy, especially with the speech rate feature, compared to the method only using the MFCC features. The accuracy improved from 91.02% to 97.02% compared to the previous method that only used MFCC features, by incorporating all the proposed features in this paper.

본 논문에서는 음성의 발화 속도와 휴지 구간의 길이 그리고 화자의 연령과 성별에 기반한 방언 분류 접근 방법을 제안한다. 방언 분류는 음성 분석을 위한 중요한 기술 중 하나이다. 예를 들어 정확한 방언 분류 모델은 화자 인식 또는 음성 인식의 성능을 향상시킬 수 있는 잠재력을 가질 수 있다. 선행 연구에 따르면, Mel-Frequency Cepstral Coefficients(MFCC) 특징을 사용한 딥러닝 기반의 연구가 주류를 이루었다. 우리는 지역 간의 음향적 차이에 주목하여 그 차이를 바탕으로 추출한 특징을 사용하여 방언 분류를 진행하였다. 본 논문에서는 음성의 발화 속도, 휴지 구간의 길이 특성을 추출하여 사용하며 이와 함께 화자의 연령과 성별과 같은 메타데이터를 추가로 사용하는 새로운 접근 방법을 제안한다. 실험 결과 제안된 접근 방법이 더 높은 정확도를 보이는 것을 확인하였으며 특히 음성의 발화 속도 특성을 사용하는 것이 기존 MFCC만을 사용하는 방법보다 향상된 성능을 보여준다는 것을 확인할 수 있었다. MFCC 특성만을 사용한 방법과 비교했을 때 본 논문에서 제안한 특성들을 모두 사용하였을 때의 정확도는 91.02%에서 97.02%로 향상되었다.

Keywords

Acknowledgement

이 논문은 2018년 대한민국 교육부와 한국연구재단의 지원(NRF-2018S1A5A2A03037308) 및 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임[RS-2022-00155915, 인공지능융합혁신인재양성사업(인하대학교)].

References

  1. Bhattacharjee, U., & Sarmah, K. (2013, March). Language identification system using MFCC and prosodic features. Proceedings of the 2013 International Conference on Intelligent Systems and Signal Processing (ISSP) (pp. 194-197). Vallabh Vidyanagar, India.
  2. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
  3. Chittaragi, N. B., & Koolagudi, S. G. (2019). Acoustic-phonetic feature based kannada dialect identification from vowel sounds. International Journal of Speech Technology, 22(4), 1099-1113. https://doi.org/10.1007/s10772-019-09646-1
  4. Chowdhury, S. A., Ali, A., Shon, S., & Glass, J. (2020, October). What does an end-to-end dialect identification model learn about non-dialectal information? Proceedings of the INTERSPEECH 2020 (pp. 462-466). Shanghai, China.
  5. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357-366. https://doi.org/10.1109/TASSP.1980.1163420
  6. de Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930. https://doi.org/10.1121/1.1458024
  7. Dheram, P., Ramakrishnan, M., Raju, A., Chen, I. F., King, B., Powell, K., & Stolcke, A. (2022, September). Toward fairness in speech recognition: Discovery and mitigation of performance disparities. Proceedings of the INTERSPEECH 2022 (pp. 1268-1272). Incheon, Korea.
  8. Fenu, G., Medda, G., Marras, M., & Meloni, G. (2020, November). Improving fairness in speaker recognition. Proceedings of the 2020 European Symposium on Software Engineering (pp. 129-136). Rome, Italy.
  9. Garcia-Romero, D., Snyder, D., Watanabe, S., Sell, G., McCree, A., Povey, D., & Khudanpur, S. (2019, September). Speaker recognition benchmark using the chime-5 corpus. Proceedings of the INTERSPEECH 2019 (pp. 1506-1510). Graz, Austria.
  10. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Schol-kopf, B. (1998). Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4), 18-28. https://doi.org/10.1109/5254.708428
  11. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q, ...Liu, T. Y. (2017, December). LightGBM: A highly efficient gradient boosting decision tree. Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, CA.
  12. Keesing, A., Koh, Y. S., & Witbrock, M. (2021, August). Acoustic features and neural representations for categorical emotion recognition from speech. Proceedings of the INTERSPEECH 2021 (pp. 3415-3419). Brno, Czechia.
  13. Khurana, S., Najafian, M., Ali, A., Hanai, T. A., Belinkov, Y., & Glass, J. (2017, August). QMDIS: QCRI-MIT advanced dialect identification system. Proceedings of the INTERSPEECH 2017 (pp. 2591-2595). Stockholm, Sweden.
  14. Kim, Y. K., & Kim, M. H. (2021). Performance comparison of Korean dialect classification models based on acoustic features. Journal of the Korea Society of Computer and Information, 26(10), 37-43. https://doi.org/10.9708/JKSCI.2021.26.10.037
  15. Lee, J., Kim, K., & Chung, M. (2021, November). Korean dialect identification based on intonation modeling. Proceedings of the 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) (pp. 168-173). Singapore, Singapore.
  16. Lee, J., Kim, K., & Chung, M. (2022, November). Korean dialect identification based on an ensemble of prosodic and segmental feature learning for forensic speaker profiling. Proceedings of the 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) (pp. 1-6). Hanoi, Vietnam.
  17. Likitha, M. S., Gupta, S. R. R., Hasitha, K., & Upendra Raju, A. (2017, March). Speech based human emotion recognition using MFCC. Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET) (pp. 2257-2260). Chennai, India.
  18. Lin, W., & Mak, M. W. (2020, October). Wav2spk: A simple DNN architecture for learning speaker embeddings from waveforms. Proceedings of the INTERSPEECH 2020 (pp. 3211-3215). Shanghai, China.
  19. Mehrabani, M., & Hansen, J. H. L. (2015). Automatic analysis of dialect/language sets. International Journal of Speech Technology, 18(3), 277-286. https://doi.org/10.1007/s10772-014-9268-y
  20. Michon, E., Pham, M. Q., Crego, J., & Senellart, J. (2018, August). Neural network architectures for arabic dialect identification. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 128-136). Santa Fe, NM.
  21. Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2020). A lazy learning-based language identification from speech using MFCC-2 features. International Journal of Machine Learning and Cybernetics, 11(1), 1-14. https://doi.org/10.1007/s13042-019-00928-3
  22. Najafian, M., Khurana, S., Shan, S., Ali, A., & Glass, J. (2018, April). Exploiting convolutional neural networks for phonotactic based dialect identification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5174-5178). Calgary, AB.
  23. Pappagari, R., Cho, J., Moro-Velazquez, L., & Dehak, N. (2020, October). Using state of the art speaker recognition and natural language processing technologies to detect Alzheimer's disease and assess its severity. Proceedings of the INTERSPEECH 2020(pp. 2177-2181). Shanghai, China.
  24. Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018, September). Emotion identification from raw speech signals using DNNs. Proceedings of the INTERSPEECH 2018 (pp. 3097-3101). Hyderabad, India.
  25. Saste, S. T., & Jagdale, S. M. (2017, April). Emotion recognition from speech using MFCC and DWT for security system. Proceedings of the 2017 International Conference of Electronics, Communication and Aerospace Technology (ICECA) (pp. 701-704). Coimbatore, India.
  26. Seo, J., & Lee, B. (2022). Multi-task conformer with multi-feature combination for speech emotion recognition. Symmetry, 14(7), 1428.
  27. Shahnawazuddin, S., Dey, A., & Sinha, R. (2016, September). Pitch-adaptive front-end features for robust children's ASR. Proceedings of the INTERSPEECH 2016 (pp. 3459-3463). San Francisco, CA.
  28. Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1-3. https://doi.org/10.1109/97.736233
  29. Tan, Z. H., Sarkar, A. K., & Dehak, N. (2020). rVAD: An unsupervised segment-based robust voice activity detection method. Computer Speech & Language, 59, 1-21. https://doi.org/10.1016/j.csl.2019.06.005
  30. Tawaqal, B., & Suyanto, S. (2021). Recognizing five major dialects in Indonesia based on MFCC and DRNN. Journal of Physics: Conference Series, 1844, 012003.
  31. Tuske, Z., Golik, P., Nolden, D., Schluter, R., & Ney, H. (2014, September). Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages. Proceedings of the INTERSPEECH 2014 (pp. 1420-1424). Singapore, Singapore.
  32. Wallington, E., Kershenbaum, B., Klejch, O., & Bell, P. (2021, August-September). On the learning dynamics of semi-supervised training for ASR. Proceedings of the INTERSPEECH 2021 (pp. 716-720). Brno, Czechia.
  33. Wan, M., Ren, J., Ma, M., Li, Z., Cao, R., & Gao, Q. (2022, March). Deep neural network based Chinese dialect classification. Proceedings of the 2021 Ninth International Conference on Advanced Cloud and Big Data (CBD) (pp. 207-212). Xi'an, China.
  34. Wang, D., Ye, S., Hu, X., Li, S., & Xu, X. (2021, August). An end-to-end dialect identification system with transfer learning from a multilingual automatic speech recognition model. Proceedings of the INTERSPEECH -2021 (pp. 3266-3270). Brno, Czechia.
  35. Ying, W., Zhang, L., & Deng, H. (2020). Sichuan dialect speech recognition with deep LSTM network. Frontiers of Computer Science, 14(2), 378-387. https://doi.org/10.1007/s11704-018-8030-z
  36. Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873-882. https://doi.org/10.1109/TASLP.2018.2797420