DOI QR코드

DOI QR Code

음향 장면 분류를 위한 경량화 모형 연구

Light weight architecture for acoustic scene classification

  • 임소영 (중앙대학교 경영경제대학 응용통계학과) ;
  • 곽일엽 (중앙대학교 경영경제대학 응용통계학과)
  • Lim, Soyoung (Department of Applied Statistics, Chung-Ang University) ;
  • Kwak, Il-Youp (Department of Applied Statistics, Chung-Ang University)
  • 투고 : 2021.06.30
  • 심사 : 2021.09.23
  • 발행 : 2021.12.31

초록

음향 장면 분류는 오디오 파일이 녹음된 환경이 어디인지 분류하는 문제이다. 이는 음향 장면 분류와 관련한 대회인 DCASE 대회에서 꾸준하게 연구되었던 분야이다. 실제 응용 분야에 음향 장면 분류 문제를 적용할 때, 모델의 복잡도를 고려하여야 한다. 특히 경량 기기에 적용하기 위해서는 경량 딥러닝 모델이 필요하다. 우리는 경량 기술이 적용된 여러 모델을 비교하였다. 먼저 log mel-spectrogram, deltas, delta-deltas 피쳐를 사용한 합성곱 신경망(CNN) 기반의 기본 모델을 제안하였다. 그리고 원래의 합성곱 층을 depthwise separable convolution block, linear bottleneck inverted residual block과 같은 효율적인 합성곱 블록으로 대체하고, 각 모델에 대하여 Quantization를 적용하여 경량 모델을 제안하였다. 경량화 기술을 고려한 모델은 기본 모델에 대비하여 성능이 비슷하거나 조금 낮은 성능을 보였지만, 모델 사이즈는 503KB에서 42.76KB로 작아진 것을 확인하였다.

Acoustic scene classification (ASC) categorizes an audio file based on the environment in which it has been recorded. This has long been studied in the detection and classification of acoustic scenes and events (DCASE). In this study, we considered the problem that ASC faces in real-world applications that the model used should have low-complexity. We compared several models that apply light-weight techniques. First, a base CNN model was proposed using log mel-spectrogram, deltas, and delta-deltas features. Second, depthwise separable convolution, linear bottleneck inverted residual block was applied to the convolutional layer, and Quantization was applied to the models to develop a low-complexity model. The model considering low-complexity was similar or slightly inferior to the performance of the base model, but the model size was significantly reduced from 503 KB to 42.76 KB.

키워드

과제정보

이 성과는 2020년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2020R1C1C1A01013020).

참고문헌

  1. Chollet F (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 1251-1258.
  2. Courbariaux M, Hubara I, Soudry D, El-Yaniv R, and Bengio Y (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1, arXiv preprint arXiv:1602.02830.
  3. Han S, Mao H, and Dally WJ (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 4th International Conference on Learning Representations (ICLR 2016).
  4. He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
  5. Heittola T, Mesaros A, and Virtanen T (2020). Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020), 56-60.
  6. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, and Adam H (2017). Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv preprint arXiv:1704.04861.
  7. Hu H, Yang CHH, Xia X et al. (2020). Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation, arXiv preprint arXiv:2007.08389.
  8. Jan MA, Zakarya M, Khan M, Mastorakis S, Menon VG, Balasubramanian V, and Rehman AU (2021). An AIenabled lightweight data fusion and load optimization approach for Internet of Things, Future Generation Computer Systems, 122, 40-51. https://doi.org/10.1016/j.future.2021.03.020
  9. Kingma DP and Ba J (2015). Adam: A method for stochastic optimization, 3rd International Conference on Learning Representations (ICLR 2015).
  10. Koutini K, Eghbal-Zadeh H, Dorfer M, and Widmer G (2019). The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification, 27th European signal processing conference (EUSIPCO 2019), 1-5.
  11. Koutini K, Henkel F, Eghbal-zadeh H, and Widmer G (2020). CP-JKU submissions to DCASE'20: Low-complexity cross-device acoustic scene classification with rf-regularized CNNs, DCASE2020 Challenge Technical Report.
  12. Lee YJ, Moon YH, Park JY, and Min OG (2019). Recent R&D trends for lightweight deep learning, Electronics and Telecommunications Trends, 34, 40-50.
  13. McDonnell M (2020). Low-complexity acoustic scene classification using one-bit-per-weight deep convolutional neural networks, DCASE2020 Challenge Technical Report
  14. McDonnell MD and Gao W (2020). Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 141-145.
  15. Mesaros A, Heittola T, and Virtanen T (2018). A multi-device dataset for urban acoustic scene classification, arXiv preprint arXiv:1807.09840.
  16. Sandler M, Howard A, Zhu M, Zhmoginov A, and Chen LC (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2018), 4510-4520.
  17. Strubell E, Ganesh A, McCallum A (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645-3650
  18. Suh S, Lim W, Jeong Y, Lee T, and Kim HY (2018). Dual CNN structured sound event detection algorithm based on real life acoustic dataset, The Korean Institute of Broadcast and Media Engineers, 23, 855-865.
  19. Suh S, Park S, Jeong Y, and Lee T (2020). Designing acoustic scene classification models with CNN variants, DCASE2020 Challenge Technical Report.
  20. Szegedy C, Liu W, Jia Y, et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2015), 1-9.
  21. Xiong Y, Kim HWJ, and Hedau V (2019). Antnets: Mobile Convolutional Neural Networks for Resource Efficient Image Classification, arXiv preprint arXiv:1904.03775.
  22. Zhang H, Cisse M, Dauphin YN, and Lopez-Paz D (2018). Mixup: Beyond empirical risk minimization, 6th International Conference on Learning Representations (ICLR 2018).