DOI QR코드

DOI QR Code

Framework Switching of Speaker Overlap Detection System

화자 겹침 검출 시스템의 프레임워크 전환 연구

  • 김회남 (한남대학교 정보통신공학과) ;
  • 박지수 (한남대학교 정보통신공학과) ;
  • 차신 (한남대학교 정보통신공학과) ;
  • 손경아 (울산과학기술원(UNIST) U교육혁신센터) ;
  • 윤영선 (한남대학교 정보통신공학과) ;
  • 박전규 (한국전자통신연구원 인공지능연구소)
  • Received : 2021.05.29
  • Accepted : 2021.06.20
  • Published : 2021.06.30

Abstract

In this paper, we introduce a speaker overlap system and look at the process of converting the existed system on the specific framework of artificial intelligence. Speaker overlap is when two or more speakers speak at the same time during a conversation, and can lead to performance degradation in the fields of speech recognition or speaker recognition, and a lot of research is being conducted because it can prevent performance degradation. Recently, as application of artificial intelligence is increasing, there is a demand for switching between artificial intelligence frameworks. However, when switching frameworks, performance degradation is observed due to the unique characteristics of each framework, making it difficult to switch frameworks. In this paper, the process of converting the speaker overlap detection system based on the Keras framework to the pytorch-based system is explained and considers components. As a result of the framework switching, the pytorch-based system showed better performance than the existing Keras-based speaker overlap detection system, so it can be said that it is valuable as a fundamental study on systematic framework conversion.

본 논문에서는 화자 겹침 시스템을 소개하고 인공지능 분야에서 널리 사용되는 프레임워크에서 이미 구축된 시스템을 전환하는 과정을 고찰하고자 한다. 화자 겹침은 대화 과정에서 두 명 이상의 화자가 동시에 발성하는 것을 말하며, 사전에 화자 겹침을 탐지하여 음성인식이나 화자인식의 성능 저하를 예방할 수 있으므로 많은 연구가 진행되고 있다. 최근 인공지능을 이용한 다양한 응용 시스템의 활용도가 높아지면서 인공지능 프레임워크 (framework) 간의 전환이 요구되고 있다. 그러나 프레임워크 전환 시 각 프레임워크의 고유 특성에 의하여 성능 저하가 관찰되고 있으며 이는 프레임워크 전환을 어렵게 하고 있다. 본 논문에서는 케라스 (Keras) 기반 화자 겹침 시스템을 파이토치 (pytorch) 시스템으로 전환하는 과정을 기술하고 고려해야 할 구성 요소들을 정리하였다. 프레임워크 전환 결과 기존 케라스 기반 화자 겹침 시스템보다 파이토치로 전환된 시스템에서 더 좋은 성능을 보여 체계적인 프레임워크 전환의 기본 연구로서 가치를 지닌다고 할 수 있다.

Keywords

Acknowledgement

이 논문은 2019년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No. 2019-0-01376, 다중화자간 대화 음성인식 기술개발).

References

  1. Reynolds, Douglas A., and P. Torres-Carrasquillo, "Approaches and applications of audio diarization", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), Vol. 5. IEEE, 2005. DOI: https://doi.org/10.1109/ICASSP.2005.1416463
  2. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., & Bengio, Y., "Theano: a CPU and GPU math expression compiler", Proceedings of the Python for scientific computing conference (SciPy), Vol. 4, No. 3, pp. 1-7. 2010. DOI: https://doi.org/10.25080/Majora-92bf1922-003
  3. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., & Zheng, X., "Tensorflow: A system for large-scale machine learning", 12th USENIX conference on Operating Systems Design and Implementation, pp. 265-283. 2016. DOI: https://doi.org/10.5555/3026877.3026899
  4. Ketkar, N. Introduction to Keras. In: Deep learning with Python, pp. 97-111, Apress, Berkeley, CA. 2017. DOI: https://doi.org/10.1007/978-1-4842-2766-4_7
  5. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., & Darrell, T., "Caffe: Convolutional architecture for fast feature embedding", In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675-678, 2014. DOI: https://doi.org/10.1145/ 2647868.2654889
  6. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., & Chintala, S., "Pytorch: An imperative style, high-performance deep learning library", arXiv preprint arXiv:1912.01703. 2019.
  7. Adam, A. G., Kajarekar,S. S., & Hermansky, H., "A new speaker change detection method for two-speaker segmentation", In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4, pp. IV-3908, IEEE, 2002. DOI: https://doi.org/10.1109/ICASSP.2002.5745511
  8. Bullock, L., Bredin, H., & Garcia-Perera, L. P. "Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection", ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7114-7118, IEEE. 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053096
  9. Garofolo, John S., et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus, LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium, 1993. DOI: https://doi.org/10.35111/17gk-bn40
  10. G. Gravier et al, "The ETAPE corpus for the evaluation of speech-based TV content processing in the French language", Proceedins of the 8th LREC, pp. 114-118, 2012, ELRA.
  11. Sajjan, N., Ganesh, S., Sharma, N., Ganapathy, S., & Ryant, N., "Leveraging LSTM models for overlap detection in multi-party meetings", In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5249-5253, IEEE, 2018. DOI: https://10.1109/ICASSP.2018.8462548
  12. Andrei, V., Cucu, H., & Burileanu, C., "Detecting Overlapped Speech on Short Timeframes Using Deep Learning", INTERSPEECH, pp. 1198-1202, 2017. DOI: https://10.21437/INTERSPEECH.2017-188
  13. Kazimirova, E., & Belyaev, A., "Automatic Detection of Multi-speaker Fragments with High Time Resolution", In INTERSPEECH, pp. 1388-1392, 2018. DOI: https://10.21437/Interspeech.2018-1878
  14. Bahrampour, S., N. Ramakrishnan, L. Schott, and M. Shah, "Comparative Study of Deep Learning Software Frameworks", arXiv:1511.06435v3, 2016.
  15. Effective Tensorflow 2.0 Guide, https://www.tensorflow.org/guide/effective_tf2?hl=en, 2021.05.29
  16. Ruder, Sebastian. "An overview of gradient descent optimization algorithms", arXiv preprint arXiv:1609.04747, 2016
  17. Bottou L., Lechevallier Y., Saporta G., "Large-Scale Machine Learning with Stochastic Gradient Descent", Proceedings of COMPSTAT'2010, 2010. DOI: https://doi.org/10.1007/978-3-7908-2604-3_16
  18. Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton, "On the importance of initialization and momentum in deep learning", Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3), pp. 1139-1147, 2013.
  19. Z. Zhang, "Improved Adam Optimizer for Deep Neural Networks", 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1-2, 2018. DOI: https://10.1109/IWQoS.2018.8624183
  20. Aitor Lewkowycz, "How to decay your learning rate", arXiv:2103.12682v1, 2021.
  21. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5329-5333, 2018. DOI: https://10.1109/ICASSP.2018.8461375.
  22. V. Bushaev, Adam - latest trends in deep learning optimization, https://towardsdata science.com/adam-latest-trends-in-deep-lea rning-optimization-6be9a291375c, 2021.01.11.
  23. Suboptimal convergence when compared with TensorFlow model, https://discuss.pytorch.org/t/suboptimal-convergence-when-compared-with-tensorflow-model/5099/38, 2021.02.22.