DOI QR코드

DOI QR Code

A study on training DenseNet-Recurrent Neural Network for sound event detection

음향 이벤트 검출을 위한 DenseNet-Recurrent Neural Network 학습 방법에 관한 연구

  • 차현진 (국립강릉원주대학교 전자공학과) ;
  • 박상욱 (국립강릉원주대학교 전자공학과)
  • Received : 2023.06.29
  • Accepted : 2023.09.01
  • Published : 2023.09.30

Abstract

Sound Event Detection (SED) aims to identify not only sound category but also time interval for target sounds in an audio waveform. It is a critical technique in field of acoustic surveillance system and monitoring system. Recently, various models have introduced through Detection and Classification of Acoustic Scenes and Events (DCASE) Task 4. This paper explored how to design optimal parameters of DenseNet based model, which has led to outstanding performance in other recognition system. In experiment, DenseRNN as an SED model consists of DensNet-BC and bi-directional Gated Recurrent Units (GRU). This model is trained with Mean teacher model. With an event-based f-score, evaluation is performed depending on parameters, related to model architecture as well as model training, under the assessment protocol of DCASE task4. Experimental result shows that the performance goes up and has been saturated to near the best. Also, DenseRNN would be trained more effectively without dropout technique.

음향 이벤트 검출(Sound Event Detection, SED)은 음향 신호에서 관심 있는 음향의 종류와 발생 구간을 검출하는 기술로, 음향 감시 시스템 및 모니터링 시스템 등 다양한 분야에서 활용되고 있다. 최근 음향 신호 분석에 관한 국제 경연 대회(Detection and Classification of Acoustic Scenes and Events, DCASE) Task 4를 통해 다양한 방법이 소개되고 있다. 본 연구는 다양한 영역에서 성능 향상을 이끌고 있는 Dense Convolutional Networks(DenseNet)을 음향 이벤트 검출에 적용하기 위해 설계 변수에 따른 성능 변화를 비교 및 분석한다. 실험에서는 DenseNet with Bottleneck and Compression(DenseNet-BC)와 순환신경망(Recurrent Neural Network, RNN)의 한 종류인 양방향 게이트 순환 유닛(Bidirectional Gated Recurrent Unit, Bi-GRU)을 결합한 DenseRNN 모델을 설계하고, 평균 교사 모델(Mean Teacher Model)을 통해 모델을 학습한다. DCASE task4의 성능 평가 기준에 따라 이벤트 기반 f-score를 바탕으로 설계 변수에 따른 DenseRNN의 성능 변화를 분석한다. 실험 결과에서 DenseRNN의 복잡도가 높을수록 성능이 향상되지만 일정 수준에 도달하면 유사한 성능을 보임을 확인할 수 있다. 또한, 학습과정에서 중도탈락을 적용하지 않는 경우, 모델이 효과적으로 학습됨을 확인할 수 있다.

Keywords

Acknowledgement

본 논문은 2022년도 강릉원주대학교 신임교원 연구비 지원과 2023년도 교육부의 재원으로 한국연구재단의 지원을 받아 수행된 지자체-대학 협력 기반 지역 혁신 사업의 결과입니다(2022RIS-005).

References

  1. L. Delphin-Poulat and C. Plapous, "Mean teacher with data augmentation for dcase 2019 task 4," Orange Labs Lannion, Tech. Rep., 2019. 
  2. A. Tarvainen and H. Valpola, "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results," Proc. NIPS, 1-10 (2017). 
  3. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE Conf. on CVPR, 770-778 (2016). 
  4. K. He, X. Zhang, S. Ren, and J. Sun, "Identity mappings in deep residual networks," Proc. Computer Vision-ECCV, 1-15 (2016). 
  5. S. Zagoruyko and N. Komodakis, "Wide residual networks," arXiv preprint arXiv:1605.07146 (2016). 
  6. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," Proc. Computer Vision-ECCV, 1-9 (2017). 
  7. B. McMahan and D. Rao, "Listening to the world improves speech command recognition," Proc. AAAI Conf. on Artificial Intelligence, 378-385 (2018). 
  8. K. Palanisamy, D. Singhania, and A. Yao, "Rethinking CNN models for audio classification," arXiv preprint arXiv:2007.11154 (2020). 
  9. PyTorch Torch.nn.GRU, https://pytorch.org/docs/stable/generated/torch.nn.GRU.html, (Last viewed February 12, 2023). 
  10. A PyTorch Implementation for Densely Connected Convolutional Networks (DenseNets), https://github.com/andreasveit/densenet-pytorch, (Last viewed February 12, 2023). 
  11. S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167 (2015). 
  12. X. Glorot, A. Bordes, and Y. Bengio, "Deep sparse rectifier neural networks," Proc. of the 14th International Conf. on Artificial Intelligence and Statistics, 315-323 (2011). 
  13. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," J. Mach. Learn. Res. 15, (2014). 
  14. DCASE 2020 Task 4 GitHub, https://github.com/turpaultn/dcase20_task4, (Last viewed February 12, 2023). 
  15. N. Turpault, R. Serizel, A. Shah, and J. Salamon, "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis," Proc. DCASE Workshop, 253-257 (2019). 
  16. S. Park and M. Elhilali, "Time-balanced focal loss for audio event detection," Proc. ICASSP, 311-315 (2022). 
  17. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Proc. 13th International Conf. on Artificial Intelligence and Statistics, 249-256 (2010). 
  18. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. IEEE ICCV, 1026-1032 (2015). 
  19. DCASE 2020 Task 4: Sound Event Detection and Separation in Domestic Environments, https://dcase.community/challenge2020/task-sound-event-detection-and-separation-in-domestic-environments, (Last viewed July 25, 2023). 
  20. A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, "Sound event detection: A tutorial," IEEE Signal Process. Mag. 38, 67-83 (2021). 
  21. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980 (2014).