DOI QR코드

DOI QR Code

Audio Event Detection Using Deep Neural Networks

깊은 신경망을 이용한 오디오 이벤트 검출

  • Lim, Minkyu (Department of Computer Science and Engineering, Sogang University) ;
  • Lee, Donghyun (Department of Computer Science and Engineering, Sogang University) ;
  • Park, Hosung (Department of Computer Science and Engineering, Sogang University) ;
  • Kim, Ji-Hwan (Department of Computer Science and Engineering, Sogang University)
  • 임민규 (서강대학교 컴퓨터공학과) ;
  • 이동현 (서강대학교 컴퓨터공학과) ;
  • 박호성 (서강대학교 컴퓨터공학과) ;
  • 김지환 (서강대학교 컴퓨터공학과)
  • Received : 2017.01.19
  • Accepted : 2017.02.25
  • Published : 2017.02.28

Abstract

This paper proposes an audio event detection method using Deep Neural Networks (DNN). The proposed method applies Feed Forward Neural Network (FFNN) to generate output probabilities of twenty audio events for each frame. Mel scale filter bank (FBANK) features are extracted from each frame, and its five consecutive frames are combined as one vector which is the input feature of the FFNN. The output layer of FFNN produces audio event probabilities for each input feature vector. More than five consecutive frames of which event probability exceeds threshold are detected as an audio event. An audio event continues until the event is detected within one second. The proposed method achieves as 71.8% accuracy for 20 classes of the UrbanSound8K and the BBC Sound FX dataset.

본 논문에서는 깊은 신경망을 이용한 오디오 이벤트 검출 방법을 제안한다. 오디오 입력의 매 프레임에 대한 오디오 이벤트 확률을 feed-forward 신경망을 적용하여 생성한다. 매 프레임에 대하여 멜 스케일 필터 뱅크 특징을 추출한 후, 해당 프레임의 전후 프레임으로부터의 특징벡터들을 하나의 특징벡터로 결합하고 이를 feed-forward 신경망의 입력으로 사용한다. 깊은 신경망의 출력층은 입력 프레임 특징값에 대한 오디오 이벤트 확률값을 나타낸다. 연속된 5개 이상의 프레임에서의 이벤트 확률값이 임계값을 넘을 경우 해당 구간이 오디오 이벤트로 검출된다. 검출된 오디오 이벤트는 1초 이내에 동일 이벤트로 검출되는 동안 하나의 오디오 이벤트로 유지된다. 제안된 방법으로 구현된 오디오 이벤트 검출기는 UrbanSound8K와 BBC Sound FX자료에서의 20개 오디오 이벤트에 대하여 71.8%의 검출 정확도를 보였다.

Keywords

References

  1. K. Kim and H. Kim, "Scaling learning algorithms towards AI," Journal of Digital Content Society, Vol. 14, No.4, pp.481-491, December, 2013. https://doi.org/10.9728/dcs.2013.14.4.481
  2. L. Lu, H. Jiang, and H. Zhang, "A robust audio classification and segmentation method," in Proceeding of ACM International Conference on Multimedia, Ottawa, pp.203-211, 2001.
  3. M. Xu, N. Maddage, C. Xu, M. Kankanhalli, and Q. Tian, "Creating audio keywords for event detection in soccer video," in Proceeding of IEEE International Conference on Multimedia and Expo, Baltimore: MD, pp.281-284, 2003.
  4. W. Cheng, W. Chu, and J. Wu, "Semantic context v detection based on hierarchical audio models," in Proceeding of ACM SIGMM International Workshop on Multimedia Information Retrieval, Berkeley: CA, pp.109-115, 2003.
  5. H. Lee, P. Pham, Y. Largman, and Y. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Proceeding of Advances in Neural Information Processing Systems, Vancouver, pp.1096-1104, 2009.
  6. Y. Bengio and Y. LeCun, "Scaling learning algorithms towards AI," Large-scale Kernel Machines, Vol. 34, No.5, pp.321-360, August, 2007.
  7. J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, "Non-speech audio event detection," in Proceeding of Internationa Conference on Acoustics, Speech and Signal Processing, Taipei, pp.1973-1976, 2009.
  8. L. Ballan, A. Bazzica, M. Bertini, A. Bimbo, and G. Serra, "Deep networks for audio event classification in soccer videos," in Proceeding of International Conference on Multimedia and Expo, Cancun, pp.474-477, 2009.
  9. T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, "Scaling learning algorithms towards AI," EURASIP Journal on Audio, Speech, and Music Processing, Vol.1, pp.1-13, January, 2013.
  10. K. Zvi, and T. Orith, "Audio event classification using deep neural networks," in Proceeding of INTERSPEECH, Lyon, pp.1482-1486, 2013.
  11. M. Lim and J. Kim, "Audio Event Classification Using Deep Neural Networks," Phonetics and Speech Sciences, Vol. 7, No. 4, pp.27-33, January, 2015. https://doi.org/10.13064/KSSS.2015.7.4.027
  12. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, "An empirical evaluation of deep architectures on problems with many factors of variation," in Proceeding of International Conference on Machine learning, Corvaliis: OR, pp.473-480, 2007.
  13. E. Dahl, N. Sainath, and E. Hinton, "Improving deep neural networks for LVCSR using rectified linear units and dropout," in Proceeding of International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp.8609-8613, 2013.
  14. L. Bottou, Advanced Lectures on Machine Learning, Springer, pp. 146-168, 2004.
  15. J. Salamon, C. Jacoby, and J. Bello, "A dataset and taxonomy for urban sound research," in Proceeding of ACM International Conference on Multimedia, Orlando: FL, pp.1041-1044, 2014.
  16. M. Slaney, "Semantic-audio retrieval," in Proceeding of International Conference on Acoustics, Speech and Signal Processing, Orlando: FL, pp.1408-1411, 2002.
  17. S. Young, G. Evermann, M. Gales, and P. Woodland, The HTK book (for HTK version 3.4), Cambridge, U.K.: Entropic, 2006.
  18. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Available: https://www.tensorflow.org/

Cited by

  1. A Speech Feature Enhancement Technique by Cepstral Noise Model for Noisy Speaker Identification vol.16, pp.3, 2018, https://doi.org/10.14801/jkiit.2018.16.3.11