Framework Switching of Speaker Overlap Detection System

Kim, Hoinam;Park, Jisu;Cha, Shin;Son, Kyung A;Yun, Young-Sun;Park, Jeon Gue;

doi:10.29056/jsav.2021.06.13

Journal of Software Assessment and Valuation (한국소프트웨어감정평가학회 논문지)

Volume 17 Issue 1
/
Pages.101-113
/
2021
/
2092-8114(pISSN)
/
2733-4384(eISSN)

Korea Software Assessment and Valuation Society (한국소프트웨어감정평가학회)

DOI QR Code

Framework Switching of Speaker Overlap Detection System

화자 겹침 검출 시스템의 프레임워크 전환 연구

김회남 (한남대학교 정보통신공학과) ;
박지수 (한남대학교 정보통신공학과) ;
차신 (한남대학교 정보통신공학과) ;
손경아 (울산과학기술원(UNIST) U교육혁신센터) ;
윤영선 (한남대학교 정보통신공학과) ;
박전규 (한국전자통신연구원 인공지능연구소)

Received : 2021.05.29
Accepted : 2021.06.20
Published : 2021.06.30

https://doi.org/10.29056/jsav.2021.06.13 Citation

⟨ Previous Next ⟩

Abstract

In this paper, we introduce a speaker overlap system and look at the process of converting the existed system on the specific framework of artificial intelligence. Speaker overlap is when two or more speakers speak at the same time during a conversation, and can lead to performance degradation in the fields of speech recognition or speaker recognition, and a lot of research is being conducted because it can prevent performance degradation. Recently, as application of artificial intelligence is increasing, there is a demand for switching between artificial intelligence frameworks. However, when switching frameworks, performance degradation is observed due to the unique characteristics of each framework, making it difficult to switch frameworks. In this paper, the process of converting the speaker overlap detection system based on the Keras framework to the pytorch-based system is explained and considers components. As a result of the framework switching, the pytorch-based system showed better performance than the existing Keras-based speaker overlap detection system, so it can be said that it is valuable as a fundamental study on systematic framework conversion.

본 논문에서는 화자 겹침 시스템을 소개하고 인공지능 분야에서 널리 사용되는 프레임워크에서 이미 구축된 시스템을 전환하는 과정을 고찰하고자 한다. 화자 겹침은 대화 과정에서 두 명 이상의 화자가 동시에 발성하는 것을 말하며, 사전에 화자 겹침을 탐지하여 음성인식이나 화자인식의 성능 저하를 예방할 수 있으므로 많은 연구가 진행되고 있다. 최근 인공지능을 이용한 다양한 응용 시스템의 활용도가 높아지면서 인공지능 프레임워크 (framework) 간의 전환이 요구되고 있다. 그러나 프레임워크 전환 시 각 프레임워크의 고유 특성에 의하여 성능 저하가 관찰되고 있으며 이는 프레임워크 전환을 어렵게 하고 있다. 본 논문에서는 케라스 (Keras) 기반 화자 겹침 시스템을 파이토치 (pytorch) 시스템으로 전환하는 과정을 기술하고 고려해야 할 구성 요소들을 정리하였다. 프레임워크 전환 결과 기존 케라스 기반 화자 겹침 시스템보다 파이토치로 전환된 시스템에서 더 좋은 성능을 보여 체계적인 프레임워크 전환의 기본 연구로서 가치를 지닌다고 할 수 있다.

Keywords

Acknowledgement

이 논문은 2019년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No. 2019-0-01376, 다중화자간 대화 음성인식 기술개발).

References

Reynolds, Douglas A., and P. Torres-Carrasquillo, "Approaches and applications of audio diarization", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), Vol. 5. IEEE, 2005. DOI: https://doi.org/10.1109/ICASSP.2005.1416463
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., & Bengio, Y., "Theano: a CPU and GPU math expression compiler", Proceedings of the Python for scientific computing conference (SciPy), Vol. 4, No. 3, pp. 1-7. 2010. DOI: https://doi.org/10.25080/Majora-92bf1922-003
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., & Zheng, X., "Tensorflow: A system for large-scale machine learning", 12th USENIX conference on Operating Systems Design and Implementation, pp. 265-283. 2016. DOI: https://doi.org/10.5555/3026877.3026899
Ketkar, N. Introduction to Keras. In: Deep learning with Python, pp. 97-111, Apress, Berkeley, CA. 2017. DOI: https://doi.org/10.1007/978-1-4842-2766-4_7
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., & Darrell, T., "Caffe: Convolutional architecture for fast feature embedding", In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675-678, 2014. DOI: https://doi.org/10.1145/ 2647868.2654889
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., & Chintala, S., "Pytorch: An imperative style, high-performance deep learning library", arXiv preprint arXiv:1912.01703. 2019.
Adam, A. G., Kajarekar,S. S., & Hermansky, H., "A new speaker change detection method for two-speaker segmentation", In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4, pp. IV-3908, IEEE, 2002. DOI: https://doi.org/10.1109/ICASSP.2002.5745511
Bullock, L., Bredin, H., & Garcia-Perera, L. P. "Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection", ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7114-7118, IEEE. 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053096
Garofolo, John S., et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus, LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium, 1993. DOI: https://doi.org/10.35111/17gk-bn40
G. Gravier et al, "The ETAPE corpus for the evaluation of speech-based TV content processing in the French language", Proceedins of the 8th LREC, pp. 114-118, 2012, ELRA.
Sajjan, N., Ganesh, S., Sharma, N., Ganapathy, S., & Ryant, N., "Leveraging LSTM models for overlap detection in multi-party meetings", In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5249-5253, IEEE, 2018. DOI: https://10.1109/ICASSP.2018.8462548
Andrei, V., Cucu, H., & Burileanu, C., "Detecting Overlapped Speech on Short Timeframes Using Deep Learning", INTERSPEECH, pp. 1198-1202, 2017. DOI: https://10.21437/INTERSPEECH.2017-188
Kazimirova, E., & Belyaev, A., "Automatic Detection of Multi-speaker Fragments with High Time Resolution", In INTERSPEECH, pp. 1388-1392, 2018. DOI: https://10.21437/Interspeech.2018-1878
Bahrampour, S., N. Ramakrishnan, L. Schott, and M. Shah, "Comparative Study of Deep Learning Software Frameworks", arXiv:1511.06435v3, 2016.
Effective Tensorflow 2.0 Guide, https://www.tensorflow.org/guide/effective_tf2?hl=en, 2021.05.29
Ruder, Sebastian. "An overview of gradient descent optimization algorithms", arXiv preprint arXiv:1609.04747, 2016
Bottou L., Lechevallier Y., Saporta G., "Large-Scale Machine Learning with Stochastic Gradient Descent", Proceedings of COMPSTAT'2010, 2010. DOI: https://doi.org/10.1007/978-3-7908-2604-3_16
Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton, "On the importance of initialization and momentum in deep learning", Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3), pp. 1139-1147, 2013.
Z. Zhang, "Improved Adam Optimizer for Deep Neural Networks", 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1-2, 2018. DOI: https://10.1109/IWQoS.2018.8624183
Aitor Lewkowycz, "How to decay your learning rate", arXiv:2103.12682v1, 2021.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5329-5333, 2018. DOI: https://10.1109/ICASSP.2018.8461375.
V. Bushaev, Adam - latest trends in deep learning optimization, https://towardsdata science.com/adam-latest-trends-in-deep-lea rning-optimization-6be9a291375c, 2021.01.11.
Suboptimal convergence when compared with TensorFlow model, https://discuss.pytorch.org/t/suboptimal-convergence-when-compared-with-tensorflow-model/5099/38, 2021.02.22.

Journal of Software Assessment and Valuation (한국소프트웨어감정평가학회 논문지)

Framework Switching of Speaker Overlap Detection System

화자 겹침 검출 시스템의 프레임워크 전환 연구

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)