Audio Event Detection Based on Attention CRNN

Kwak, Jin-Yeol;Chung, Yong-Joo;

doi:10.13067/JKIECS.2020.15.3.465

The Journal of the Korea institute of electronic communication sciences (한국전자통신학회논문지)

Volume 15 Issue 3
/
Pages.465-472
/
2020
/
1975-8170(pISSN)

Korea Institute of Electronic Communication Science (한국전자통신학회)

DOI QR Code

Audio Event Detection Based on Attention CRNN

Attention CRNN에 기반한 오디오 이벤트 검출

Kwak, Jin-Yeol ;
Chung, Yong-Joo (Dept. Electronic Engineering, Keimyung University)

곽진열 (계명대학교 전기전자융합시스템공학과) ;
정용주 (계명대학교 전자공학과)

Received : 2020.03.31
Accepted : 2020.06.15
Published : 2020.06.30

https://doi.org/10.13067/JKIECS.2020.15.3.465 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, various deep neural networks based methods have been proposed for audio event detection. In this study, we improved the performance of audio event detection by adopting an attention approach to a baseline CRNN. We applied context gating at the input of the baseline CRNN and added an attention layer at the output. We improved the performance of the attention based CRNN by using the audio data of strong labels in frame units as well as the data of weak labels in clip levels. In the audio event detection experiments using the audio data from the Task 4 of the DCASE 2018/2019 Challenge, we could obtain maximally a 66% relative increase in the F-score in the proposed attention based CRNN compared with the baseline CRNN.

최근 들어, 오디오 이벤트 검출을 위하여 다양한 딥뉴럴네트워크 기반의 방법들이 제안되어 왔다. 본 연구에서는 베이스라인 CRNN(Convolutional Recurrent Neural Network) 구조에 attention 방식을 도입함으로서 오디오 이벤트 검출의 성능을 향상시키고자 하였다. 베이스라인 CRNN의 입력단에 context gating을 적용하고 출력단에 attention layer을 추가하였다. 또한, 프레임(frame) 단위의 강전사 레이블(strong label)정보 뿐만 아니라 클립(clip) 단위의 약전사 레이블(weakly label) 오디오 데이터를 이용한 학습을 통하여 보다 나은 성능을 이루고자 하였다. DCASE 2018/2019 Challenge Task 4 데이터를 이용한 오디오 이벤트 검출 실험에서 제안된 attention 기반의 CRNN을 통하여 기존의 CRNN 방식에 비해서 최대 66%의 상대적 F-score 향상을 얻을 수 있었다.

Keywords

References

M. K. Nandwana, A. Ziaei, and J. H. L. Hansen, "Robust Unsupervised Detection of Human Screams In Noisy Acoustic Environments," Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, Apr. 2015.
M. Crocco, M. Christani, A. Trucco, and V. Murino, "Audio Surveillance: A Systematic Review," ACM Computing Surveys, vol. 48. no. 4, Feb. 2016, pp.52:1-52:46.
Y. Lee and P. Moon, "A Comparison and Analysis of Deep Learning Framework," J. of the Korea Institute of Electronic Communication Sciences, vol. 12, no. 1, 2017, pp. 115-122. https://doi.org/10.13067/JKIECS.2017.12.1.115
Y. Wang, L. Neves, and F. Metze, "Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, Mar. 2016, pp. 2742-2746.
A. Mesaros, T. Heittola, and T. Virtanen, "Metrics for polyphonic sound event detection," Applied Sciences, vol. 6, no. 6, 2016, pp. 321-337. https://doi.org/10.3390/app6110321
S. Chung and Y. Chung, "Sound Event Detection based on Deep Neural Networks," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 2, 2019, pp. 389-396. https://doi.org/10.13067/JKIECS.2019.14.2.389
S. Chung and Y. Chung, "Comparison of Audio Event Detection Performance using DNN," J. of the Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, 2018, pp. 571-577. https://doi.org/10.13067/JKIECS.2018.13.3.571
A. Graves, A. Mohamed, and G. Hinton, "Speech Recognition with Deep Recurrent Neural Networks," Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 6645-6649.
E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, "Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection," IEEE/ACM Trans. On Audio Speech and Language Process., vol. 26. no. 6, 2017, pp. 1291-1303.
Y. Xu., Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, "Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging," in Proc. Interspeech Aug. 2017, pp. 3083-3087.
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech recognition," in Advances in Neural Information Processing Systems, Dec. 2015, pp. 577-585.
V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, "Recurrent models of visual attention," in Advances in Neural Information Processing Systems, 2014, pp. 2204-2212.
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in International Conference on Learning Representation(ICLR), May, 2015.
N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis," Workshop on Detection and Classification of Acoustic Scenes and Events, Oct. 2019.