Acknowledgement
이 논문은 2018년 대한민국 교육부와 한국연구재단의 지원(NRF-2018S1A5A2A03037308) 및 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임[RS-2022-00155915, 인공지능융합혁신인재양성사업(인하대학교)].
References
- Bhattacharjee, U., & Sarmah, K. (2013, March). Language identification system using MFCC and prosodic features. Proceedings of the 2013 International Conference on Intelligent Systems and Signal Processing (ISSP) (pp. 194-197). Vallabh Vidyanagar, India.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
- Chittaragi, N. B., & Koolagudi, S. G. (2019). Acoustic-phonetic feature based kannada dialect identification from vowel sounds. International Journal of Speech Technology, 22(4), 1099-1113. https://doi.org/10.1007/s10772-019-09646-1
- Chowdhury, S. A., Ali, A., Shon, S., & Glass, J. (2020, October). What does an end-to-end dialect identification model learn about non-dialectal information? Proceedings of the INTERSPEECH 2020 (pp. 462-466). Shanghai, China.
- Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357-366. https://doi.org/10.1109/TASSP.1980.1163420
- de Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917-1930. https://doi.org/10.1121/1.1458024
- Dheram, P., Ramakrishnan, M., Raju, A., Chen, I. F., King, B., Powell, K., & Stolcke, A. (2022, September). Toward fairness in speech recognition: Discovery and mitigation of performance disparities. Proceedings of the INTERSPEECH 2022 (pp. 1268-1272). Incheon, Korea.
- Fenu, G., Medda, G., Marras, M., & Meloni, G. (2020, November). Improving fairness in speaker recognition. Proceedings of the 2020 European Symposium on Software Engineering (pp. 129-136). Rome, Italy.
- Garcia-Romero, D., Snyder, D., Watanabe, S., Sell, G., McCree, A., Povey, D., & Khudanpur, S. (2019, September). Speaker recognition benchmark using the chime-5 corpus. Proceedings of the INTERSPEECH 2019 (pp. 1506-1510). Graz, Austria.
- Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Schol-kopf, B. (1998). Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4), 18-28. https://doi.org/10.1109/5254.708428
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q, ...Liu, T. Y. (2017, December). LightGBM: A highly efficient gradient boosting decision tree. Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, CA.
- Keesing, A., Koh, Y. S., & Witbrock, M. (2021, August). Acoustic features and neural representations for categorical emotion recognition from speech. Proceedings of the INTERSPEECH 2021 (pp. 3415-3419). Brno, Czechia.
- Khurana, S., Najafian, M., Ali, A., Hanai, T. A., Belinkov, Y., & Glass, J. (2017, August). QMDIS: QCRI-MIT advanced dialect identification system. Proceedings of the INTERSPEECH 2017 (pp. 2591-2595). Stockholm, Sweden.
- Kim, Y. K., & Kim, M. H. (2021). Performance comparison of Korean dialect classification models based on acoustic features. Journal of the Korea Society of Computer and Information, 26(10), 37-43. https://doi.org/10.9708/JKSCI.2021.26.10.037
- Lee, J., Kim, K., & Chung, M. (2021, November). Korean dialect identification based on intonation modeling. Proceedings of the 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) (pp. 168-173). Singapore, Singapore.
- Lee, J., Kim, K., & Chung, M. (2022, November). Korean dialect identification based on an ensemble of prosodic and segmental feature learning for forensic speaker profiling. Proceedings of the 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) (pp. 1-6). Hanoi, Vietnam.
- Likitha, M. S., Gupta, S. R. R., Hasitha, K., & Upendra Raju, A. (2017, March). Speech based human emotion recognition using MFCC. Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET) (pp. 2257-2260). Chennai, India.
- Lin, W., & Mak, M. W. (2020, October). Wav2spk: A simple DNN architecture for learning speaker embeddings from waveforms. Proceedings of the INTERSPEECH 2020 (pp. 3211-3215). Shanghai, China.
- Mehrabani, M., & Hansen, J. H. L. (2015). Automatic analysis of dialect/language sets. International Journal of Speech Technology, 18(3), 277-286. https://doi.org/10.1007/s10772-014-9268-y
- Michon, E., Pham, M. Q., Crego, J., & Senellart, J. (2018, August). Neural network architectures for arabic dialect identification. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 128-136). Santa Fe, NM.
- Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2020). A lazy learning-based language identification from speech using MFCC-2 features. International Journal of Machine Learning and Cybernetics, 11(1), 1-14. https://doi.org/10.1007/s13042-019-00928-3
- Najafian, M., Khurana, S., Shan, S., Ali, A., & Glass, J. (2018, April). Exploiting convolutional neural networks for phonotactic based dialect identification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5174-5178). Calgary, AB.
- Pappagari, R., Cho, J., Moro-Velazquez, L., & Dehak, N. (2020, October). Using state of the art speaker recognition and natural language processing technologies to detect Alzheimer's disease and assess its severity. Proceedings of the INTERSPEECH 2020(pp. 2177-2181). Shanghai, China.
- Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018, September). Emotion identification from raw speech signals using DNNs. Proceedings of the INTERSPEECH 2018 (pp. 3097-3101). Hyderabad, India.
- Saste, S. T., & Jagdale, S. M. (2017, April). Emotion recognition from speech using MFCC and DWT for security system. Proceedings of the 2017 International Conference of Electronics, Communication and Aerospace Technology (ICECA) (pp. 701-704). Coimbatore, India.
- Seo, J., & Lee, B. (2022). Multi-task conformer with multi-feature combination for speech emotion recognition. Symmetry, 14(7), 1428.
- Shahnawazuddin, S., Dey, A., & Sinha, R. (2016, September). Pitch-adaptive front-end features for robust children's ASR. Proceedings of the INTERSPEECH 2016 (pp. 3459-3463). San Francisco, CA.
- Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1-3. https://doi.org/10.1109/97.736233
- Tan, Z. H., Sarkar, A. K., & Dehak, N. (2020). rVAD: An unsupervised segment-based robust voice activity detection method. Computer Speech & Language, 59, 1-21. https://doi.org/10.1016/j.csl.2019.06.005
- Tawaqal, B., & Suyanto, S. (2021). Recognizing five major dialects in Indonesia based on MFCC and DRNN. Journal of Physics: Conference Series, 1844, 012003.
- Tuske, Z., Golik, P., Nolden, D., Schluter, R., & Ney, H. (2014, September). Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages. Proceedings of the INTERSPEECH 2014 (pp. 1420-1424). Singapore, Singapore.
- Wallington, E., Kershenbaum, B., Klejch, O., & Bell, P. (2021, August-September). On the learning dynamics of semi-supervised training for ASR. Proceedings of the INTERSPEECH 2021 (pp. 716-720). Brno, Czechia.
- Wan, M., Ren, J., Ma, M., Li, Z., Cao, R., & Gao, Q. (2022, March). Deep neural network based Chinese dialect classification. Proceedings of the 2021 Ninth International Conference on Advanced Cloud and Big Data (CBD) (pp. 207-212). Xi'an, China.
- Wang, D., Ye, S., Hu, X., Li, S., & Xu, X. (2021, August). An end-to-end dialect identification system with transfer learning from a multilingual automatic speech recognition model. Proceedings of the INTERSPEECH -2021 (pp. 3266-3270). Brno, Czechia.
- Ying, W., Zhang, L., & Deng, H. (2020). Sichuan dialect speech recognition with deep LSTM network. Frontiers of Computer Science, 14(2), 378-387. https://doi.org/10.1007/s11704-018-8030-z
- Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873-882. https://doi.org/10.1109/TASLP.2018.2797420