DOI QR코드

DOI QR Code

Performance of music section detection in broadcast drama contents using independent component analysis and deep neural networks

ICA와 DNN을 이용한 방송 드라마 콘텐츠에서 음악구간 검출 성능

  • 허운행 (충북대학교 제어로봇공학전공 대학원) ;
  • 장병용 (충북대학교 제어로봇공학전공 대학원) ;
  • 조현호 (충북대학교 제어로봇공학전공 대학원) ;
  • 김정현 (한국전자통신연구원) ;
  • 권오욱 (충북대학교)
  • Received : 2018.08.08
  • Accepted : 2018.09.27
  • Published : 2018.09.30

Abstract

We propose to use independent component analysis (ICA) and deep neural network (DNN) to detect music sections in broadcast drama contents. Drama contents mainly comprise silence, noise, speech, music, and mixed (speech+music) sections. The silence section is detected by signal activity detection. To detect the music section, we train noise, speech, music, and mixed models with DNN. In computer experiments, we used the MUSAN corpus for training the acoustic model, and conducted an experiment using 3 hours' worth of Korean drama contents. As the mixed section includes music signals, it was regarded as a music section. The segmentation error rate (SER) of music section detection was observed to be 19.0%. In addition, when stereo mixed signals were separated into music signals using ICA, the SER was reduced to 11.8%.

Keywords

References

  1. Aguilo, M., Butko, T., Temko, A., & Nadeu, C. (2009). A hierarchical architecture for audio segmentation in a broadcast news task. Proceedings of I Iberian SLTech (pp. 17-20).
  2. Boersma, P., & Weenink, D. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9-10), 341-345.
  3. Castan, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., Docio-Fernandez, L., Ramos, D., Serrano, J., Ortega, A., & Lleida, E. (2015). Albayzín-2014 evaluation: Audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 33. https://doi.org/10.1186/s13636-015-0076-3
  4. Galibert, O. (2013). Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. Proceedings of INTERSPEECH-2013 (pp. 1131-1134).
  5. Gallardo-Antolin, A., & Hernandez, R. S. S. (2010). UPM-UC3M system for music and speech segmentation. Proceedings of VI Jornadas en Tecnología del Habla II Iberian SLTech Workshop (FALA) (pp. 421-424).
  6. Gallardo-Antolin, A., & Montero, J. M. (2010). Histogram equalizationbased features for speech, music and song discrimination. IEEE Signal Processing Letters, 17(7), 659-662. https://doi.org/10.1109/LSP.2010.2049877
  7. Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). I-vector based speaker adaptation of deep neural networks for French broadcast audio transcription. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 6334-6338).
  8. Heittola, T., Mesaros, A., Virtanen, T., & Eronen, A. (2011). Sound event detection in multisource environments using source separation. Proceedings of Workshop Machine Listening in Multisource Environments (pp. 36-40).
  9. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97. https://doi.org/10.1109/MSP.2012.2205597
  10. Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, 10(3), 626-634. https://doi.org/10.1109/72.761722
  11. Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5), 411-430. https://doi.org/10.1016/S0893-6080(00)00026-5
  12. Justusson, B. I. (1981). Median filtering: Statistical properties. In T. S. Huang (Ed.), Two-Dimensional Digital Signal Processing II (pp. 161-196). Berlin: Springer.
  13. Lee, G. H. (2015). A study on the appropriateness of music broadcasting fee of terrestrial broadcasters. Music Content and Law, 203-250.
  14. Metallinou, A., Lee, S., & Narayanan, S. (2008). Audio-visual emotion recognition using Gaussian mixture models for face and voice. Proceedings of International Symposium on Multimedia (ISM) (pp. 250-257).
  15. Mirsa, H., Ikbal, S., Bourlard, H., & Hermansky, H. (2004). Spectral entropy based feature for robust ASR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 193-196).
  16. Muller, M., & Ewert, S. (2011). Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. Proceedings of International Society for Music Information Retrieval Conference (ISMIR) (pp. 215-220).
  17. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
  18. Saon, G., Soltau, H., Nahamoo, D., & Picheny, M., (2013). Speaker adaptation of neural network acoustic models using i-vectors. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 55-59).
  19. SBS (2015). SBS drama special: Mask. Retrieved from http://programs.sbs.co.kr/drama/2015mask on September 27, 2018.
  20. Snyder, D., Chen, G., & Povey, D. (2015). Musan: A music, speech, and noise corpus. Retrieved from https://arxiv.org/abs/1510.08484v1 on September 27, 2018.
  21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.
  22. van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. The Journal of Machine Learning Research, 9(1), 2579-2605.
  23. Wang, S. S., Lin, P., Lyu, D. C., Tsao, Y., Hwang, H. T., & Su, B. (2014). Acoustic feature conversion using a polynomial based feature transferring algorithm. Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 454-458).