DOI QR코드

DOI QR Code

Comparison of environmental sound classification performance of convolutional neural networks according to audio preprocessing methods

오디오 전처리 방법에 따른 콘벌루션 신경망의 환경음 분류 성능 비교

  • Oh, Wongeun (Department of Multimedia Engineering, Sunchon National University)
  • 오원근 (순천대학교 멀티미디어공학전공)
  • Received : 2020.02.25
  • Accepted : 2020.04.22
  • Published : 2020.05.31

Abstract

This paper presents the effect of the feature extraction methods used in the audio preprocessing on the classification performance of the Convolutional Neural Networks (CNN). We extract mel spectrogram, log mel spectrogram, Mel Frequency Cepstral Coefficient (MFCC), and delta MFCC from the UrbanSound8K dataset, which is widely used in environmental sound classification studies. Then we scale the data to 3 distributions. Using the data, we test four CNNs, VGG16, and MobileNetV2 networks for performance assessment according to the audio features and scaling. The highest recognition rate is achieved when using the unscaled log mel spectrum as the audio features. Although this result is not appropriate for all audio recognition problems but is useful for classifying the environmental sounds included in the Urbansound8K.

본 논문에서는 딥러닝(deep learning)을 이용하여 환경음 분류 시 전처리 단계에서 사용하는 특징 추출 방법이 콘볼루션 신경망의 분류 성능에 미치는 영향에 대해서 다루었다. 이를 위해 환경음 분류 연구에서 많이 사용되는 UrbanSound8K 데이터셋에서 멜 스펙트로그램(mel spectrogram), 로그 멜 스펙트로그램(log mel spectrogram), Mel Frequency Cepstral Coefficient(MFCC), 그리고 delta MFCC를 추출하고 각각을 3가지 분포로 스케일링하였다. 이 데이터를 이용하여 4 종의 콘볼루션 신경망과 이미지넷에서 좋은 성능을 보였던 VGG16과 MobileNetV2 신경망을 학습시킨 다음 오디오 특징과 스케일링 방법에 따른 인식률을 구하였다. 그 결과 인식률은 스케일링하지 않은 로그 멜 스펙트럼을 사용했을 때 가장 우수한 것으로 나타났다. 도출된 결과를 모든 오디오 인식 문제로 일반화하기는 힘들지만, Urbansound8K의 환경음이 포함된 오디오를 분류할 때는 유용하게 적용될 수 있을 것이다.

Keywords

References

  1. K. J. Piczak, "Environmental sound classification with convolutional neural networks," Proc. IEEE 25th International Workshop on Machine Learning for Signal Processing, 1-6 (2015).
  2. Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural network," Proc. 2017 IEEE ICASSP. 2721-2725 (2017).
  3. V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, "Classifying environmental sounds using image recognition networks," Procedia Comput. Sci. 112, 2048-2056 (2017). https://doi.org/10.1016/j.procs.2017.08.250
  4. Y. Su, K. Zhang, J. Wang, and K. Madani, "Environment sound classification using a two-stream CNN based on decision-level fusion," Sensors, 19, 1733 (2019). https://doi.org/10.3390/s19071733
  5. J. Lee, W. Kim, and K. Lee, "Convolutional neural network based traffic sound classification robust to environmental noise" (in Korean), J. Acoust. Soc. Kr. 37, 469-474 (2018).
  6. K. Ko, S. Park, and H. Ko, "Convolutional neural network based amphibian sound classification using covariance and modulogram" (in Korean), J. Acoust. Soc. Kr. 37, 60-65 (2018).
  7. W. Oh, "Audio classification performance of CNN according to audio feature extraction methods" (in Korean), Proc. J. Acoust. Soc. Kr. Supple.2(s) 38, 64 (2019).
  8. J. Salamon, C. Jacoby, and J. P. Bello, "A dataset and taxonomy for urban sound research," Proc. of the 22nd ACM International Conf. on Multimedia, 1041-1044 (2014).
  9. J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," IEEE Signal Process. Lett. 24, 279-283 (2017). https://doi.org/10.1109/LSP.2017.2657381
  10. B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, and O. Nieto, "Librosa: Audio and music signal analysis in python," Proc. 14th Python Sci. Conf. 18-24 (2015).
  11. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980 (2014).
  12. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. F. -Fei, "ImageNet large scale visual recognition challenge," Int. J. Computer Vision, 115, 211-252 (2015). https://doi.org/10.1007/s11263-015-0816-y
  13. K. Simonyan and A. Zisseman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2015).
  14. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, "MobileNetV2: Inverted residuals and linear bottlenecks," Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 4510-4520 (2018).