DOI QR코드

DOI QR Code

Fast Speaker Identification Using a Universal Background Model Clustering Method

Universal Background Model 클러스터링 방법을 이용한 고속 화자식별

  • 박주민 (한국과학기술원 전기및전자공학과) ;
  • 서영주 (한국과학기술원 전기및전자공학과) ;
  • 김회린 (한국과학기술원 전기및전자공학과)
  • Received : 2014.01.21
  • Accepted : 2014.04.16
  • Published : 2014.05.31

Abstract

In this paper, we propose a new method to drastically reduce computational complexity in Gaussian Mixture Model (GMM)-based Speaker Identification (SI). Generally, GMM-based SI systems have very high computational complexity proportional to the length of the test utterance, the number of enrolled speakers, and the GMM size. These make the SI systems difficult to be used in various real applications in spite of their broad applicability. Thus, a trade-off between computational complexity and identification accuracy is considered as a primary issue for practical applications. In order to reduce computational complexity sharply with a little loss of accuracy, we introduce a method based on the Universal Background Model (UBM) clustering approach and then we show that it can be used successfully in real-time applications. In experiments with the proposed algorithm, we obtained a speed-up factor of 6 with a negligible loss of accuracy.

본 논문은 Gaussian Mixture Model (GMM) 기반의 화자식별에서 급격한 계산 복잡도 감소를 위한 새로운 방법을 제안한다. 일반적으로 GMM 기반의 화자식별 시스템은 테스트 발성의 길이, 등록 화자의 수, GMM의 크기 등 크게 세 가지 요인에 비례하는 많은 계산 복잡도를 가진다. 이러한 점은 화자식별 시스템이 다양한 응용분야에 적용되는 것을 막는 큰 요인이기에 계산 복잡도와 식별 성능 사이의 trade-off 관계는 실제 적용을 위해 가장 중요한 고려요소이다. 식별 성능을 거의 그대로 유지하면서 최대한 계산 복잡도를 감소시키기 위해 우리는 Universal Background Model (UBM) 클러스터링 접근 방법을 제시하고, 또한 이 방법은 실시간 구조의 화자식별에 적용할 수 있다는 것을 보여준다. 제안한 방법의 실험을 통해 미미한 정도의 식별 성능 저하에서 speed-up factor 6의 결과를 얻을 수 있었다.

Keywords

References

  1. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, "Support vector machines using GMM supervectors for speaker verification," IEEE Signal Process. Lett. 13, 308-311 (2006). https://doi.org/10.1109/LSP.2006.870086
  2. B. Xiang and T. Berger, "Efficient text-independent speaker verification with structural Gaussian mixture models and neural network," IEEE Trans. Speech Audio Process.11, 447-456 (2003). https://doi.org/10.1109/TSA.2003.815822
  3. K. R. Farrell, R. Mammone, and K. Assaleh, "Speaker recognition using neural networks and conventional classifiers," IEEE Trans. Speech Audio Process. 2, 194-205 (1994). https://doi.org/10.1109/89.260362
  4. D. A. Reynolds and R. Rose, "Robust text-independent speaker identification using Gaussian mixture models," IEEE Trans. Signal Process. 3, 72-83 (1995).
  5. D. A. Reynolds, T. F. Quatieri, and R. Dunn, "Speaker verification using adapted Gaussian mixture models," Digit. Signal Process. 10, 19-41 (2000). https://doi.org/10.1006/dspr.1999.0361
  6. J. L. Gauvain and C. H. Lee, "Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains," IEEE Trans. Speech Audio Process. 2, 291-298 (1994). https://doi.org/10.1109/89.279278
  7. T. Kinnunen, E. Karpov, and P. Franti, "Real-time speaker identification and verification," IEEE Trans. Audio Speech Lang. Process. 14, 277-288 (2006). https://doi.org/10.1109/TSA.2005.853206
  8. J. McLaughlin, D. A. Reynolds, and T. Gleeson, "A study of computation speed-ups of the GMM-UBM speaker recognition system," in Proc. Eur. Conf. Speech Commun. Technol. (Eurospeech), 1215-1218 (1999).
  9. Z.-H. Tan and B. Lindberg, "Low-complexity variable frame rate analysis for speech recognition and voice activity detection," IEEE J. Sel. Top. Signal Process. 4, 798-807 (2010). https://doi.org/10.1109/JSTSP.2010.2057192
  10. Q. Zhu and A. Alwan, "On the use of variable frame rate analysis in speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1783-1786 (2000).
  11. V. R. Apsingekar and P. L. D. Leon, "Speaker model clustering for efficient speaker identification in large population application," IEEE Trans. Audio Speech Lang. Process.17, 848-853 (2009). https://doi.org/10.1109/TASL.2008.2010882