Harmonic Structure Features for Robust Speaker Diarization

Zhou, Yu;Suo, Hongbin;Li, Junfeng;Yan, Yonghong;

doi:10.4218/etrij.12.0111.0455

ETRI Journal

Volume 34 Issue 4
/
Pages.583-590
/
2012
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Harmonic Structure Features for Robust Speaker Diarization

Zhou, Yu (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
Suo, Hongbin (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
Li, Junfeng (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences) ;
Yan, Yonghong (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences)

Received : 2011.07.18
Accepted : 2012.04.03
Published : 2012.08.30

https://doi.org/10.4218/etrij.12.0111.0455 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we present a new approach for speaker diarization. First, we use the prosodic information calculated on the original speech to resynthesize the new speech data utilizing the spectrum modeling technique. The resynthesized data is modeled with sinusoids based on pitch, vibration amplitude, and phase bias. Then, we use the resynthesized speech data to extract cepstral features and integrate them with the cepstral features from original speech for speaker diarization. At last, we show how the two streams of cepstral features can be combined to improve the robustness of speaker diarization. Experiments carried out on the standardized datasets (the US National Institute of Standards and Technology Rich Transcription 04-S multiple distant microphone conditions) show a significant improvement in diarization error rate compared to the system based on only the feature stream from original speech.

Keywords

References

N.W.D. Evans, C. Fredouille, and J.F. Bonastre, "Speaker Diarization Using Unsupervised Discriminant Analysis of Inter-Channel Delay Features," Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., ICASSP, 2009, pp. 4061-4064.
J. Pelecanos and S. Sridharan, "Feature Warping for Robust Speaker Verification," A Speaker Odyssey - The Speaker Recognition Workshop, Crete, Greece, 2001, pp. 213-218.
P. Ouellet, G. Boulianne, and P. Kenny, "Flavors of Gaussian Warping," Proc. Interspeech, 2005, pp. 2957-2960.
R. Sinha et al., "The Cambridge University March 2005 Speaker Diarization System," Proc. Interspeech, 2005, pp. 2437-2440.
X. Zhu et al., "Speaker Diarization: From Broadcast News to Lectures," Machine Learning for Multimodal Interaction, 2006, pp. 396-406.
G. Friedland et al., "Prosodic and Other Long-Term Features for Speaker Diarization," IEEE Trans. Audio, Speech, Language Process., vol. 17, no. 5, 2009, pp. 985-993. https://doi.org/10.1109/TASL.2009.2015089
G. Friedland et al., "Fusing Short Term and Long Term Features for Improved Speaker Diarization," IEEE Int. Conf. Acoustics, Speech, Signal Process., 2009, pp. 4077-4080.
X. Serra, "Musical Sound Modeling with Sinusoids Plus Noise," Studies on New Music Research: Musical Signal Processing, C. Roads et al., Eds., The Netherlands: Swets & Zeitlinger, 1997, pp. 91-122.
R.J. McAulay and T.F. Quatieri, "Magnitude-Only Reconstruction Using a Sinusoidal Speech Model," Proc. ICASSP, 1984, pp. 1-27.
C. Cao et al., "Harmonic Structure Features for Robust Speaker Recognition against Channel Effect," 2nd Int. Symp. Inf. Sci. Eng., 2009, pp. 451-454.
C. Wooters and M. Huijbregts, "The ICSI RT07s Speaker Diarization System," Multimodal Technologies for Perception of Humans, 2008, pp. 509-519.
C. Fredouille and G. Senay, "Technical Improvements of the E-HMM Based Speaker Diarization System for Meeting Records," Machine Learning for Multimodal Interaction, May 2006, pp. 359-370.
Y. Zhou et al., "An Improved Speaker Diarization System for Multiple Distance Microphone Meetings," 5th Int. Conf. Int. Computation Technol. Autom., 2012, pp. 80-83.
A. Adami et al., "Qualcomm-ICSI-OGI Features for ASR," Proc. 7th Int. Conf. Spoken Language Process., 2002, pp. 21-24.
BeamformIt toolkit. http://www.xavieranguera.com/beamformit/
C. Wooters et al., "Toward Robust Speaker Segmentation: ICSI-SRI Fall 2004 Diarization System," Proc. Rich Transcription Workshop (RT-04), 2004.
J. Ajmera, I. Lapidot, and I. McCowan, "Unknown Multiple Speaker Clustering Using HMM," Int. Conf. Spoken Language Process., 2002, pp. 573-576.
S.S. Chen and P.S. Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proc. DARPA Broadcast News Transcription Understanding Workshop, 1998, pp. 127-132.
C. Cao et al., "Singing Melody Extraction in Polyphonic Music by Harmonic Tracking," Proc. 8th Int. Conf. Music Inf. Retrieval, 2007, pp. 373-374.
D. Imseng and G. Friedland, "Tuning-Robust Initialization Methods for Speaker Diarization," IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 8, 2010, pp. 2028-2037. https://doi.org/10.1109/TASL.2010.2040796
http://nist.gov/speech/tests/rt/rt2004/fall

ETRI Journal

Harmonic Structure Features for Robust Speaker Diarization

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)