DOI QR코드

DOI QR Code

PNCC와 robust Mel-log filter bank 특징을 결합한 조류 울음소리 분류

Bird sounds classification by combining PNCC and robust Mel-log filter bank features

  • Badi, Alzahra (School of Electrical Engineering, Korea University) ;
  • Ko, Kyungdeuk (School of Electrical Engineering, Korea University) ;
  • Ko, Hanseok (School of Electrical Engineering, Korea University)
  • 투고 : 2018.11.07
  • 심사 : 2019.01.25
  • 발행 : 2019.01.31

초록

본 논문에서는 합성곱 신경망(Convolutional Neural Network, CNN) 구조를 이용하여 잡음 환경에서 음향신호를 분류할 때, 인식률을 높이는 결합 특징을 제안한다. 반면, Wiener filter를 이용한 강인한 log Mel-filter bank와 PNCCs(Power Normalized Cepstral Coefficients)는 CNN 구조의 입력으로 사용되는 2차원 특징을 형성하기 위해 추출됐다. 자연환경에서 43종의 조류 울음소리를 포함한 ebird 데이터베이스는 분류 실험을 위해 사용됐다. 잡음 환경에서 결합 특징의 성능을 평가하기 위해 ebird 데이터베이스를 3종류의 잡음을 이용하여 4개의 다른 SNR (Signal to Noise Ratio)(20 dB, 10 dB, 5 dB, 0 dB)로 합성했다. 결합 특징은 Wiener filter를 적용한 log-Mel filter bank, 적용하지 않은 log-Mel filter bank, 그리고 PNCC와 성능을 비교했다. 결합 특징은 잡음이 없는 환경에서 1.34 % 인식률 향상으로 다른 특징에 비해 높은 성능을 보였다. 추가적으로, 4단계 SNR의 잡음 환경에서 인식률은 shop 잡음 환경과 schoolyard 잡음 환경에서 각각 1.06 %, 0.65 % 향상했다.

In this paper, combining features is proposed as a way to enhance the classification accuracy of sounds under noisy environments using the CNN (Convolutional Neural Network) structure. A robust log Mel-filter bank using Wiener filter and PNCCs (Power Normalized Cepstral Coefficients) are extracted to form a 2-dimensional feature that is used as input to the CNN structure. An ebird database is used to classify 43 types of bird species in their natural environment. To evaluate the performance of the combined features under noisy environments, the database is augmented with 3 types of noise under 4 different SNRs (Signal to Noise Ratios) (20 dB, 10 dB, 5 dB, 0 dB). The combined feature is compared to the log Mel-filter bank with and without incorporating the Wiener filter and the PNCCs. The combined feature is shown to outperform the other mentioned features under clean environments with a 1.34 % increase in overall average accuracy. Additionally, the accuracy under noisy environments at the 4 SNR levels is increased by 1.06 % and 0.65 % for shop and schoolyard noise backgrounds, respectively.

키워드

GOHHBH_2019_v38n1_39_f0001.png 이미지

Fig. 1. Feature extraction for robust log Mel-filter bank and PNCC feature.

GOHHBH_2019_v38n1_39_f0002.png 이미지

Fig 2. Power spectral density of the extracted features of (a) log mel-filter bank of clean signal, (b) log mel-filter bank under shop noise (10 dB), (c) robust log mel-filter bank under shop noise (10 dB), (d) PNCC under shop noise, and (e) combine features under noise (10 dB).

GOHHBH_2019_v38n1_39_f0003.png 이미지

Fig. 3. Convolutional neural network architecture for classifying using (a) single feature (baseline) (b) combined features.

GOHHBH_2019_v38n1_39_f0004.png 이미지

Fig. 4. The confusion matrix for noise with type shop under SNR of 10 dB for (a) robust log mel-filter bank (b) PNCC features and (c) combined features.

Table 1. Bird species classification in clean environment.

GOHHBH_2019_v38n1_39_t0001.png 이미지

Table 2. Bird species classification accuracy in noisy environment.

GOHHBH_2019_v38n1_39_t0002.png 이미지

참고문헌

  1. R. Radhakrishnan, A. Divakaran, and A. Smaragdis, "Audio analysis for surveillance applications," Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., 158-161 (2005).
  2. J. Salamon and J. P. Bello, "Deep convolutional neural networks and sata augmentation for environmental sound classification," IEEE Signal Process. Lett., 24, 279-283 (2017). https://doi.org/10.1109/LSP.2017.2657381
  3. F. R. Gonzalez-Hernandez, L. P. Sanchez-Fernandez, S. Suarez-Guerra, and L. A. Sanchez-Perez, "Marine mammal sound classification based on a parallel recognition model and octave analysis," Applied Acoustics, 119, 17-28 (2017). https://doi.org/10.1016/j.apacoust.2016.11.016
  4. M. Malfante, J. Mars, M. D. Mura, C. Gervaise, J. I. Mars, and C. Gervaise, "Automatic fish sounds classification," J. Acoust. Soc. Am. 143, 2834-2846 (2018). https://doi.org/10.1121/1.5036628
  5. O. M. Aodha, R. Gibb, K. E. Barlow, E. Browning, M. Firman, R. Freeman, B. Harder, L. Kinsey, G. R. Mead, S. E. Newson, I. Pandourski, S. Parsons, J. Russ, A. Szodorary-Paradi, F. Szodoray-Paradi, E. Tilova, M. Girolami, G. Brostow, and K. E. Jones, "Bat detective-Deep learning tools for bat acoustic signal detection," PLoS Comput. Biol., 14, e1005995 (2018). https://doi.org/10.1371/journal.pcbi.1005995
  6. F. Briggs, B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich, S. J. K. Hadley, A. S. Hadley, and M. G. Betts, "Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach." J. Acoust. Soc. Am. 131, 4640-4650 (2012). https://doi.org/10.1121/1.4707424
  7. K. Ko, S. Park, and H. Ko, "Convolutional feature vectors and support vector machine for animal sound classification," Proc. IEEE Eng. Med. Biol. Soc. 376-379 (2018).
  8. R. Lu and Z. Duan, "Bidirectional Gru for sound event detection," Detection and Classification of Acoustic Scenes and Events (DCASE), (2017).
  9. T. H. Vu and J.-C. Wang, "Acoustic scene and event recognition using recurrent neural networks," Detection and Classification of Acoustic Scenes and Events (DCASE), (2016).
  10. Y. Miao, M. Gowayyed, and F. Metze, "EESEN: End-to-End speech recognition using deep RNN models and WFST-based decoding," 2015 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2015, 167-174 (2016).
  11. D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, "End-to-End Attention-based large vocabulary speech recognition," Acoust. Speech Signal Process (ICASSP), 2016 IEEE Int. Conf., 4945-4949 (2016).
  12. A. Ahmed, Y. Hifny, K. Shaalan, and S. Toral, "Lexicon free Arabic speech recognition recipe," Advances in Intelligent Systems and Computing, 533, 147-159 (2017). https://doi.org/10.1007/978-3-319-48308-5_15
  13. C. Kim and R. M. Stern, "Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction," Proc. 10th Annu. Conf. Int. Speech Commun. Assoc. (INTERSPEECH), 28-31 (2009).
  14. M. J. Alam, P. Kenny, and D. O'Shaughnessy, "Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique," Digit. Signal Process., 29, 147-157 (2014). https://doi.org/10.1016/j.dsp.2014.03.001
  15. M. T. S. Al-Kaltakchi, W. L. Woo, S. S. Dlay, and J. A. Chambers, "Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification," 4th Int. Work. Biometrics Forensics (IWBF), 1-6 (2016).
  16. S. Park, S. Mun, Y. Lee, D. K. Han, and H. Ko, "Analysis acoustic features for acoustic scene classification and score fusion of multi-classification systems applied to DCASE 2016 challenge," arXiv Prepr. arXiv1807.04970 (2018).
  17. N. Upadhyay and R. K. Jaiswal, "Single channel speech enhancement: using Wiener filtering with recursive noise estimation," Procedia Comput. Sci., 84, 22-30 (2016). https://doi.org/10.1016/j.procs.2016.04.061
  18. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Advances in neural information processing systems, 1097-1105 (2012).
  19. P. M. Chauhan and N. P. Desai, "Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using Wiener filter," Green Computing Communication and Electrical Engineering (ICGCCEE), 1-5 (2014).
  20. S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I: Estimation theory (PTR Prentice-Hall, Englewood Cliffs, 1993), pp. 400-409.
  21. T. Gerkmann and R. C. Hendriks, "Noise power estimation based on the probability of speech presence," Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), 145-148 (2011).
  22. S. S. Stevens, "On the psychological law," Psychological Review, 64, 153 (1957). https://doi.org/10.1037/h0046162
  23. L. Zhang, L. Zhang, and B. Du, "Deep learning for remote sensing data: A technical tutorial on the state of the art," IEEE Geosci. Remote Sens. Mag., 4, 22-40 (2016). https://doi.org/10.1109/MGRS.2016.2540798
  24. K. Ko, S. Park, and H. Ko, "Convolutional neural network based amphibian sound classification using covariance and modulogram" (in Korean), J. Acoust. Soc. Kr. 37, 60-65 (2018).
  25. J. Park, W. Kim, D. K. Han, and H. Ko, "Voice activity detection in noisy environments based on double-combined fourier transform and line fitting," Sci. World J., 2014, e146040 (2014).
  26. ITU-T, ITU-T P.56, Objective Measurement of Active Speech Level, 2011.