Browse > Article
http://dx.doi.org/10.9728/dcs.2017.18.1.183

Audio Event Detection Using Deep Neural Networks  

Lim, Minkyu (Department of Computer Science and Engineering, Sogang University)
Lee, Donghyun (Department of Computer Science and Engineering, Sogang University)
Park, Hosung (Department of Computer Science and Engineering, Sogang University)
Kim, Ji-Hwan (Department of Computer Science and Engineering, Sogang University)
Publication Information
Journal of Digital Contents Society / v.18, no.1, 2017 , pp. 183-190 More about this Journal
Abstract
This paper proposes an audio event detection method using Deep Neural Networks (DNN). The proposed method applies Feed Forward Neural Network (FFNN) to generate output probabilities of twenty audio events for each frame. Mel scale filter bank (FBANK) features are extracted from each frame, and its five consecutive frames are combined as one vector which is the input feature of the FFNN. The output layer of FFNN produces audio event probabilities for each input feature vector. More than five consecutive frames of which event probability exceeds threshold are detected as an audio event. An audio event continues until the event is detected within one second. The proposed method achieves as 71.8% accuracy for 20 classes of the UrbanSound8K and the BBC Sound FX dataset.
Keywords
Audio event detection; Deep neural network; Feed forward neural network;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 E. Dahl, N. Sainath, and E. Hinton, "Improving deep neural networks for LVCSR using rectified linear units and dropout," in Proceeding of International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp.8609-8613, 2013.
2 L. Bottou, Advanced Lectures on Machine Learning, Springer, pp. 146-168, 2004.
3 J. Salamon, C. Jacoby, and J. Bello, "A dataset and taxonomy for urban sound research," in Proceeding of ACM International Conference on Multimedia, Orlando: FL, pp.1041-1044, 2014.
4 M. Slaney, "Semantic-audio retrieval," in Proceeding of International Conference on Acoustics, Speech and Signal Processing, Orlando: FL, pp.1408-1411, 2002.
5 S. Young, G. Evermann, M. Gales, and P. Woodland, The HTK book (for HTK version 3.4), Cambridge, U.K.: Entropic, 2006.
6 M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, Tensorflow: Large-scale machine learning on heterogeneous distributed systems, Available: https://www.tensorflow.org/
7 K. Kim and H. Kim, "Scaling learning algorithms towards AI," Journal of Digital Content Society, Vol. 14, No.4, pp.481-491, December, 2013.   DOI
8 L. Lu, H. Jiang, and H. Zhang, "A robust audio classification and segmentation method," in Proceeding of ACM International Conference on Multimedia, Ottawa, pp.203-211, 2001.
9 W. Cheng, W. Chu, and J. Wu, "Semantic context v detection based on hierarchical audio models," in Proceeding of ACM SIGMM International Workshop on Multimedia Information Retrieval, Berkeley: CA, pp.109-115, 2003.
10 M. Xu, N. Maddage, C. Xu, M. Kankanhalli, and Q. Tian, "Creating audio keywords for event detection in soccer video," in Proceeding of IEEE International Conference on Multimedia and Expo, Baltimore: MD, pp.281-284, 2003.
11 H. Lee, P. Pham, Y. Largman, and Y. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Proceeding of Advances in Neural Information Processing Systems, Vancouver, pp.1096-1104, 2009.
12 Y. Bengio and Y. LeCun, "Scaling learning algorithms towards AI," Large-scale Kernel Machines, Vol. 34, No.5, pp.321-360, August, 2007.
13 J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, "Non-speech audio event detection," in Proceeding of Internationa Conference on Acoustics, Speech and Signal Processing, Taipei, pp.1973-1976, 2009.
14 L. Ballan, A. Bazzica, M. Bertini, A. Bimbo, and G. Serra, "Deep networks for audio event classification in soccer videos," in Proceeding of International Conference on Multimedia and Expo, Cancun, pp.474-477, 2009.
15 T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, "Scaling learning algorithms towards AI," EURASIP Journal on Audio, Speech, and Music Processing, Vol.1, pp.1-13, January, 2013.
16 H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, "An empirical evaluation of deep architectures on problems with many factors of variation," in Proceeding of International Conference on Machine learning, Corvaliis: OR, pp.473-480, 2007.
17 K. Zvi, and T. Orith, "Audio event classification using deep neural networks," in Proceeding of INTERSPEECH, Lyon, pp.1482-1486, 2013.
18 M. Lim and J. Kim, "Audio Event Classification Using Deep Neural Networks," Phonetics and Speech Sciences, Vol. 7, No. 4, pp.27-33, January, 2015.   DOI