Acknowledgement
이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(NRF-2021R1F1A1063347).
References
- D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, "Speaker diarization using deep neural network embeddings," Proc. ICASSP, 4930-4934 (2017).
- Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, "Speaker diarization with LSTM," Proc. ICASSP, 5239-5243 (2018).
- M. Diez, L. Burget, S. Wang, J. Rohdin, and H. Cernocky, "Bayesian HMM based x-Vector clustering for speaker diarization," Proc. Interspeech, 346-350 (2019).
- I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, "Target-speaker voice activity detection: a novel approach for multispeaker diarization in a dinner party scenario," Proc. Interspeech, 274-278 (2020).
- Y. C. Liu, E. Han, C. Lee, and A. Stolcke, "End-to-end neural diarization: From transformer to conformer," Proc. Interspeech, 3081-3085 (2021).
- Z. Du, S. Zhang, S. Zheng, and Z. Yan, "Speaker embedding-aware neural diarization: A novel framework for overlapping speech diarization in the meeting scenario," arXiv preprint arXiv:2203.09767 (2022).
- Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, "End-to-end neural speaker diarization with permutation-free objectives," Proc. Interspeech, 4300-4304 (2019).
- Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, "End-to-end neural speaker diarization with self-attention," Proc. ASRU, 296-303 (2019).
- Y. Yu, D. Park, and H. K. Kim, "Auxiliary loss of transformer with residual connection for end-to-end speaker diarization," Proc. ICASSP, 8377-8381 (2022).
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: An asr corpus based on public domain audio books," Proc. ICASSP, 5206-5210 (2015).
- D. Snyder, G. Chen, and D. Povey, "Musan: A music, speech, and noise corpus," arXiv preprint arXiv:1510.08484 (2015).
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," Proc. ICASSP, 5220-5224 (2017).
- J. Carletta, "Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus," Lang. Resour. Eval. 41, 181-190 (2007). https://doi.org/10.1007/s10579-007-9040-x
- A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, "The ICSI meeting corpus," Proc. ICASSP, 364-367 (2003).
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS, 5998-6008 (2017).
- J. G. Fiscus, J. Ajot, and J. S. Garofolo, The Rich Transcription 2007 Meeting Recognition Evaluation (Springer, Maryland, 2007), pp. 373-389.
- H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W, Bouaziz, and M. P. Gill, "Pyannote. audio: Neural building blocks for speaker diarization," Proc. ICASSP, 7124-7128 (2020).
- H. Bredin and A. Laurent, "End-to-end speaker segmentation for overlap-aware resegmentation," Proc. Interspeech, 3111-3115 (2021).