DOI QR코드

DOI QR Code

Voxceleb과 한국어를 결합한 새로운 데이터셋으로 학습된 ECAPA-TDNN을 활용한 화자 검증

Speaker verification with ECAPA-TDNN trained on new dataset combined with Voxceleb and Korean

  • Keumjae Yoon (Department of Statistics, Pusan National University) ;
  • Soyoung Park (Department of Statistics, Pusan National University)
  • 투고 : 2023.07.19
  • 심사 : 2023.10.25
  • 발행 : 2024.04.30

초록

화자검증(speaker verification)이란 두개의 음성 데이터로부터 같은 화자의 목소리 인지 아닌지를 판단하는것을 말한다. 범죄현장에서 범인의 목소리만이 증거로 남는경우, 두개의 목소리를 객관적이고 정확하게 비교할 수 있는 화자 검증 시스템 또는 화자 매칭 시스템의 구축이 시급하다. 본 연구에서는 한국어에 대한 화자검증 딥러닝 모형을 새롭게 구축하고, 학습에 필요한 적절한 형태의 학습데이터셋에 대해 연구한다. 음성데이터는 고차원이면서 백그라운드 노이즈를 포함하는 등의 변동성이 큰 특징이 있다. 따라서 화자 검증 시스템을 구축하기위해 딥러닝 기반의 방법 선택하는경우가 많다. 본 연구에서는 ECAPA-TDNN 모형을 선택하여 화자 매칭 알고리즘을 구축하였다. 구축한 모형을 학습시키는데 사용한 Voxceleb은 대용량의 목소리 데이터로 다양한 국적을 가진 사람들로부터 음성데이터를 포함하지만 한국어에 대한 정보는 포함하지 않는 다. 본 연구에서는 한국어 음성데이터를 학습에 포함시켰을때와 포함시키지 않았을때 학습 데이터 내 해당언어의 존재 유무가 모델의 성능에 미치는 영향에 대해 파악하였다. Voxceleb으로만 학습한 모델과 언어와 화자의 다양성을 최대로 하기 위해 Voxceleb과 한국어 데이터셋을 결합한 데이터셋으로 학습한 모델을 비교하였을 때, 모든 테스트 셋에 대해 한국어를 포함한 학습데이터의 성능이 개선됨을 보인다.

Speaker verification is becoming popular as a method of non-face-to-face identity authentication. It involves determining whether two voice data belong to the same speaker. In cases where the criminal's voice remains at the crime scene, it is vital to establish a speaker verification system that can accurately compare the two voice evidence. In this study, to achieve this, a new speaker verification system was built using a deep learning model for Korean language. High-dimensional voice data with a high variability like background noise made it necessary to use deep learning-based methods for speaker matching. To construct the matching algorithm, the ECAPA-TDNN model, known as the most famous deep learning system for speaker verification, was selected. A large dataset of the voice data, Voxceleb, collected from people of various nationalities without Korean. To study the appropriate form of datasets necessary for learning the Korean language, experiments were carried out to find out how Korean voice data affects the matching performance. The results showed that when comparing models learned only with Voxceleb and models learned with datasets combining Voxceleb and Korean datasets to maximize language and speaker diversity, the performance of learning data, including Korean, is improved for all test sets.

키워드

과제정보

이 과제는 부산대학교 기본연구지원사업(2년)에 의하여 연구되었음.

참고문헌

  1. Allen, Jont B, and Lawrence R Rabiner (1977). A unified approach to short-time Fourier analysis and synthesis, Proceedings of the IEEE, 65, 1558-1564. https://doi.org/10.1109/PROC.1977.10770
  2. Cochran WT, Cooley JW, Favin DL et al. (1967). What is the fast Fourier transform?, Proceedings of the IEEE, 55, 1664-1674. https://doi.org/10.1109/PROC.1967.5957
  3. Dehak N, Kenny PJ, Dehak R, Dumouchel P, and Ouellet P (2010). Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19, 788-798. https://doi.org/10.1109/TASL.2010.2064307
  4. Deng J, Guo J, Xue N, and Zafeiriou S (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 4690-4699.
  5. Desplanques B, Thienpondt J, and Demuynck K (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification, In Interspeech, 2020, Available from: arXiv preprint arXiv:2005.07143
  6. Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, and Torr P (2019). Res2net: A new multi-scale backbone architecture, IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 652-662. https://doi.org/10.1109/TPAMI.2019.2938758
  7. Garcia-Romero D and Espy-Wilson CY (2011). Analysis of i-vector length normalization in speaker recognition systems, Twelfth Annual Conference of the International Speech Communication Association, 2011, 249-252.
  8. He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 770-778.
  9. Hu J, Li S, and Gang S (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 7132-7141.
  10. Kingma DP and Ba J (2014). Adam: A method for stochastic optimization, In International Conference on Learning Representations (ICLR), 2015, Available from: arXiv preprint arXiv:1412.6980
  11. Krizhevsky A (2009). Learning multiple layers of features from tiny images (Technical report), University of Toronto, Toronto.
  12. Nagrani A, Chung JS, Xie W, and Zisserman A (2020). Voxceleb: Large-scale speaker verification in the wild, Computer Speech & Language, 60, 101027.
  13. Okabe K, Takafumi K, and Koichi S (2018). Attentive statistics pooling for deep speaker embedding, In Proc. Interspeech, 2018, 2252-2256, Available from: arXiv preprint arXiv:1803.10963
  14. Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, and Le QV (2019). Specaugment: A simple data augmentation method for automatic speech recognition, Communications for Statistical Applications and Methods, 27, 431-443. https://doi.org/10.21437/Interspeech.2019-2680
  15. Peddinti V, Povey D, and Khudanpur S (2015). A time delay neural network architecture for efficient modeling of long temporal contexts, Sixteenth annual conference of the international speech communication association.
  16. Reynolds DA, Thomas FQ, and Robert BD (2000). Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, 10, 19-41. https://doi.org/10.1006/dspr.1999.0361
  17. Viikki O and Kari L (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition, Speech Communication, 25, 133-147. https://doi.org/10.1016/S0167-6393(98)00033-8