Sequential Speaker Classification Using Quantized Generic Speaker Models

양자화 된 범용 화자모델을 이용한 연속적 화자분류

  • Kwon, Soon-Il (Division of Systems Technology, Korea Institute of Science and Technology)
  • 권순일 (한국과학기술연구원 시스템연구부)
  • Published : 2007.01.25

Abstract

In sequential speaker classification, the lack of prior information about the speakers poses a challenge for model initialization. To address the challenge, a predetermined generic model set, called Sample Speaker Models, was previously proposed. This approach can be useful for accurate speaker modeling without requiring initial speaker data. However, an optimal method for sampling the models from a generic model pool is still required. To solve this problem, the Speaker Quantization method, motivated by vector quantization, is proposed. Experimental results showed that the new approach outperformed the random sampling approach with 25% relative improvement in error rate on switchboard telephone conversations.

연속적 화자 분류에 있어서 분류 대상이 되는 화자에 대한 정보가 없거나 부족할 경우 정확한 연속적 분류가 어렵다. 이러한 문제를 해결하기 위해 표본 화자모델을 이용하는 방법이 제안되었는데, 이 방법을 이용하면 미리 준비된 화자의 데이터가 없이 화자모델 초기화와 화자분류가 가능해진다. 하지만 여전히 화자모델의 표본을 얻는 방법에 어려움이 따른다. 이 문제를 해결하기 위해 벡터 양자화에서 비롯된 화자 양자화를 제안한다. 유선전화 데이터를 이용한 실험에서 화자 양자화를 이용한 표본 화자모델 방법은 무작위 표본추출 방법을 이용할 경우 보다 25%의 성능 향상을 보였다.

Keywords

References

  1. J. P. Campbell, 'Speaker recognition: A tutorial,' in Proc. of IEEE, Vol. 85, pp. 1436-1462, 1997 https://doi.org/10.1109/JPROC.1997.628713
  2. T. M. Cover and J.~A. Thomas, 'Elements of Information Theory, Wiley Interscience, New York, pp. 18- 19, 1991
  3. M. Do, 'Fast Approximation of Kullback-Leibler Distance for Dependence Trees and Hidden Markov Models,' IEEE Signal Processing Letters, Vol. 10, pp. 115-118, 2003 https://doi.org/10.1109/LSP.2003.809034
  4. R.M. Gray and D. L. Neuhoff, 'Quantization,' IEEE Trans. on Information Theory, Vol. 44, pp. 2325-2383, 1998 https://doi.org/10.1109/18.720541
  5. T. Hastie, H. R. Tibshirani and J. Friedman, 'The Elements of Statistical Learning,' Springer, New York, pp. 496-498, 2001
  6. R. V. Hogg and E. A. Tanis, 'Probability and Statistical Inference,' 6th ed. Prentice Hall, New Jersey, pp.85-102, 2001
  7. A. Jain, P. Moulin, M. I. Miller and K. Ramchandran, 'Information-Theoretic Bounds on Target Recognition Performance Based on Degraded Image Data,' IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 24, pp. 1153-1166, 2002 https://doi.org/10.1109/TPAMI.2002.1033209
  8. T. Kinnunen, T. Kilpelainen and P. Franti, 'Comparison of Clustering Algorithms in Speaker Identification,' in Proc. of International Conf. of Signal Processing and Communications (SPC 2000), pp. 222-227, 2000
  9. S. Kwon and S. Narayanan, 'A Study of Generic Models for Unsupervised On-Line Speaker Indexing,' in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 423-428, St. Thomas, U.S. Virgin Islands, 2003 https://doi.org/10.1109/ASRU.2003.1318478
  10. S. Kwon and S. Narayanan, 'Speaker Model Quantization for Unsupervised Speaker Indexing,' in Proc. of International Conf. Spoken Language Processing, WeC2102p.18, Jeju, Korea, 2004
  11. S. Kwon and S. Narayanan, 'Unsupervised Speaker Indexing Using Generic Models,' IEEE Trans. on Speech and Audio Processing, Vol. 13, Issue 5, Part 2, pp.1004-1013, 2005 https://doi.org/10.1109/TSA.2005.851981
  12. M. Liu, E. Chang and B. Q. Dai, 'Hierarchical Gaussian Mixture Model for Speaker Verification,' in Proc. of International Conf. on Spoken Language Processing, Vol. 2, pp. 1353-1356, Denver, U.S.A., 2002
  13. L. Lu, H. J. Zhang and H. Jiang, 'Content Analysis for Audio Classification and Segmemtation,' IEEE Trans. on Speech and Audio Processing, Vol. 10, pp. 504-516, 2002 https://doi.org/10.1109/TSA.2002.804546
  14. M. Nishida and T. Kawahara, 'Unsupervised Speaker Indexing Using Speaker Model Selection Based on Bayesian Information Criterion,' in Proc. of IEEE International Conf. on Acoustics, Speech and Signal Processing, Vol. 1, pp. 172-175, Hong Kong, China, 2003
  15. J. Wu and E. Chang, 'Cohorts Based Custom Models for Rapid Speaker and Dialect Adaptation,' in Proc. of Eurospeech, pp. 1261-1264, Aalborg, Denmark, 2001
  16. T. Wu, L. Lu, K. Chen and H. Zhang, 'UBM-Based Real-Time Speaker Segmentation for Broadcasting News,' in Proc. of IEEE International Conf. on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 193-196, Hong Kong, China, 2003 https://doi.org/10.1109/ICASSP.2003.1202327
  17. J. Yang, X. Zhu, R. Gross, J. Kominek, Y. Pan and A. Waibel, 'Multimodal People ID for a Multimedia Meeting Browser,' in Proc. of 7th ACM International Conf. on Multimedia, Part 1, pp. 159-168, 1999 https://doi.org/10.1145/319463.319484