DOI QR코드

DOI QR Code

I-vector similarity based speech segmentation for interested speaker to speaker diarization system

화자 구분 시스템의 관심 화자 추출을 위한 i-vector 유사도 기반의 음성 분할 기법

  • Received : 2020.06.29
  • Accepted : 2020.08.06
  • Published : 2020.09.30

Abstract

In noisy and multi-speaker environments, the performance of speech recognition is unavoidably lower than in a clean environment. To improve speech recognition, in this paper, the signal of the speaker of interest is extracted from the mixed speech signals with multiple speakers. The VoiceFilter model is used to effectively separate overlapped speech signals. In this work, clustering by Probabilistic Linear Discriminant Analysis (PLDA) similarity score was employed to detect the speech signal of the interested speaker, which is used as the reference speaker to VoiceFilter-based separation. Therefore, by utilizing the speaker feature extracted from the detected speech by the proposed clustering method, this paper propose a speaker diarization system using only the mixed speech without an explicit reference speaker signal. We use phone-dataset consisting of two speakers to evaluate the performance of the speaker diarization system. Source to Distortion Ratio (SDR) of the operator (Rx) speech and customer speech (Tx) are 5.22 dB and -5.22 dB respectively before separation, and the results of the proposed separation system show 11.26 dB and 8.53 dB respectively.

잡음이 많고 여러 사람이 있는 공간에서 음성인식의 성능은 깨끗한 환경보다 저하될 수밖에 없다. 이러한 문제점을 해결하기 위해 본 논문에서는 여러 신호가 섞인 혼합 음성에서 관심 있는 화자의 음성만 추출한다. 중첩된 구간에서도 효과적으로 분리해내기 위해 VoiceFilter 모델을 사용하였으며, VoiceFilter 모델은 여러 화자의 발화로 이루어진 음성과 관심 있는 화자의 발화로만 이루어진 참조 음성이 입력으로 필요하다. 따라서 본 논문에서는 Probabilistic Linear Discriminant Analysis(PLDA) 유사도 점수로 군집화하여 혼합 음성만으로도 참조 음성을 대체해 사용하였다. 군집화로 생성한 음성에서 추출한 화자 특징과 혼합 음성을 VoiceFilter 모델에 넣어 관심 있는 화자의 음성만 분리함으로써 혼합 음성만으로 화자 구분 시스템을 구축하였다. 2명의 화자로 이루어진 전화 상담 데이터로 화자 구분 시스템의 성능을 평가하였으며, 분리 전 상담사(Rx)와 고객(Tx)의 음성 Source to Distortion Ratio(SDR)은 각각 5.22 dB와 -5.22 dB에서 분리 후 각각 11.26 dB와 8.53 dB로 향상된 성능을 보였다.

Keywords

References

  1. G. Sell and D. Garcia-Romero, "Speaker diarization with plda i-vector scoring and unsupervised calibration," Proc. of the IEEE Spoken Language Technology Workshop, 413-417 (2014).
  2. G. Sell and D. Garcia-Romero, "Diarization resegmentation in the factor analysis subspace", Proc. ICASSP. 4794-4798 (2015).
  3. D. Dimitriadis and P. Fousek, "Developing on-line speaker diarization system," Proc. Interspeech, 2739-2743 (2017).
  4. Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, "Speaker diarization with LSTM," Proc. ICASSP. 5239-5243 (2018).
  5. Q. Lin, R. Yin, M. Li, H. Bredin, and C. Barras, "LSTM based similarity measurement with spectral clustering for speaker diarization," Proc. Interspeech, Graz, 366-370 (2019).
  6. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," arXiv: 1810.04826 (2018).
  7. E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4080-4084 (2014).
  8. G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, "End-to-end text-dependent speaker verification," Proc. IEEE ICASSP. 5115-5119 (2016).
  9. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted gaussian mixture models," Digital Signal Processing, 10, 19-41 (2000). https://doi.org/10.1006/dspr.1999.0361
  10. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans on. Audio, Speech, and Language Processing, 19, 788-798 (2011). https://doi.org/10.1109/TASL.2010.2064307
  11. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, "Generalized end-to-end loss for speaker verification," arXiv preprint rXiv:1710.10467 (2017).
  12. W. Kim and J. H. L. Hansen, "Advanced parallel combined Gaussian mixture model based feature compensation integrated with iterative channel estimation," Speech Communication, 73, 81-93 (2015). https://doi.org/10.1016/j.specom.2015.07.008
  13. S. J. D. Prince and J. H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," Proc. IEEE 11th ICCV. 1-8 (2007).
  14. E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. on Audio, Speech, and Lang. Processing, 14, 1462-1469 (2006). https://doi.org/10.1109/TSA.2005.858005