Browse > Article
http://dx.doi.org/10.7776/ASK.2021.40.5.466

Proposal of speaker change detection system considering speaker overlap  

Park, Jisu (Department of Information and Communication Engineering, Hannam University)
Yun, Young-Sun (Department of Information and Communication Engineering, Hannam University)
Cha, Shin (Department of Information and Communication Engineering, Hannam University)
Park, Jeon Gue (ETRI)
Abstract
Speaker Change Detection (SCD) refers to finding the moment when the main speaker changes from one person to the next in a speech conversation. In speaker change detection, difficulties arise due to overlapping speakers, inaccuracy in the information labeling, and data imbalance. To solve these problems, TIMIT corpus widely used in speech recognition have been concatenated artificially to obtain a sufficient amount of training data, and the detection of changing speaker has performed after identifying overlapping speakers. In this paper, we propose an speaker change detection system that considers the speaker overlapping. We evaluated and verified the performance using various approaches. As a result, a detection system similar to the X-Vector structure was proposed to remove the speaker overlapping region, while the Bi-LSTM method was selected to model the speaker change system. The experimental results show a relative performance improvement of 4.6 % and 13.8 % respectively, compared to the baseline system. Additionally, we determined that a robust speaker change detection system can be built by conducting related studies based on the experimental results, taking into consideration text and speaker information.
Keywords
Speaker overlap detection; Speaker representation; Speaker change detection; Deep neural networks;
Citations & Related Records
연도 인용수 순위
  • Reference
1 V. Andrei, H. Cucu, and C. Burileanu. "Detecting overlapped speech on short time frames using deep learning," Proc. Interspeech, 1198-1202 (2017).
2 E. Kazimirova, A. Belyaev, "Automatic detection of multi speaker fragments with high time resolution," Proc. ICASSP. 1338-1392 (2018).
3 A. G. Adam, S. S. Kajarekar, and H. Hermansky, "A new speaker change detection method for two-speaker segmentation," Proc. ICASSP. 3908-3911 (2002).
4 J. Park, S. Cha, S. Eun, J. G. Park, and Y.-S. Yun, "Data augmentation and d-vector representation methods for speaker change detection," Proc. ICRACS. 67-71 (2020).
5 H. Bredin, "TristouNet: Triplet loss for speaker turn embedding," Proc. Interspeech, 5430-5434 (2017).
6 WebRTC Homepage, http://webrtc.org, (Last viewed November 21, 2020).
7 D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018).
8 H. Kim, J. Park, S. Cha, K. A Son, Y.-S. Yun, and J. G. Park, "Framework switching of speaker overlap detection system" (in Korean), J. SW Assessment and Valuation, 17, 101-113 (2021).   DOI
9 Z. Ge, A. N. Iyer, S. Cheluvaraja, and A. Ganapathiraju, "Speaker change detection using features through a neural network speaker classier," Proc. IEEE SAI Intelligent Systems Conference, 1111-1116 (2017).
10 V. Zue, S. Sene, and S. Glass, "Speech database development at MIT: TIMIT and beyond," Speech communication, 9, 351-356 (1990).   DOI
11 M. Kunesova, M. Hruz, Z. Zajc, and V. Radova, "Detection of overlapping speech for the purposes of speaker diarization," Proc. ICSC. 247-257 (2019).
12 L. Bullock, H. Bredin, and L. P. Garcia Perera, "Overlap aware diarization: Resegmentation using neural end-to-end overlapped speech detection," Proc. ICASSP. 7114-7118 (2020).
13 N. Sajjan, S. Ganesh, N. Sharma, S. Ganapathy, and N. Ryant, "Leveraging lstm models for overlap detection in multi party meetings," Proc. ICASSP. 5249-5253 (2018).
14 R. Yin, H. Bredin, and C. Barras, "Speaker change detection in broadcast tv using bidirectional long short term memory networks," Proc. Interspeech, 3827-3831 (2017).
15 S. C. Levinson, "Turn-taking in human communication - Origins and implications for language processing," Trends in Cognitive Sciences, 20, 6-14 (2016).   DOI