[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2018.10.3.019

Performance of music section detection in broadcast drama contents using independent component analysis and deep neural networks

Heo, Woon-Haeng (충북대학교 제어로봇공학전공 대학원)
Jang, Byeong-Yong (충북대학교 제어로봇공학전공 대학원)
Jo, Hyeon-Ho (충북대학교 제어로봇공학전공 대학원)
Kim, Jung-Hyun (한국전자통신연구원)
Kwon, Oh-Wook (충북대학교)

Publication Information

Phonetics and Speech Sciences / v.10, no.3, 2018 , pp. 19-29 More about this Journal

Abstract

We propose to use independent component analysis (ICA) and deep neural network (DNN) to detect music sections in broadcast drama contents. Drama contents mainly comprise silence, noise, speech, music, and mixed (speech+music) sections. The silence section is detected by signal activity detection. To detect the music section, we train noise, speech, music, and mixed models with DNN. In computer experiments, we used the MUSAN corpus for training the acoustic model, and conducted an experiment using 3 hours' worth of Korean drama contents. As the mixed section includes music signals, it was regarded as a music section. The segmentation error rate (SER) of music section detection was observed to be 19.0%. In addition, when stereo mixed signals were separated into music signals using ICA, the SER was reduced to 11.8%.

Keywords

independent component analysis; deep neural network; segmentation error rate;

Citations & Related Records

Reference

1	Aguilo, M., Butko, T., Temko, A., & Nadeu, C. (2009). A hierarchical architecture for audio segmentation in a broadcast news task. Proceedings of I Iberian SLTech (pp. 17-20).
2	Boersma, P., & Weenink, D. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9-10), 341-345.
3	Castan, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., Docio-Fernandez, L., Ramos, D., Serrano, J., Ortega, A., & Lleida, E. (2015). Albayzín-2014 evaluation: Audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 33. DOI
4	Galibert, O. (2013). Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. Proceedings of INTERSPEECH-2013 (pp. 1131-1134).
5	Gallardo-Antolin, A., & Hernandez, R. S. S. (2010). UPM-UC3M system for music and speech segmentation. Proceedings of VI Jornadas en Tecnología del Habla II Iberian SLTech Workshop (FALA) (pp. 421-424).
6	Gallardo-Antolin, A., & Montero, J. M. (2010). Histogram equalizationbased features for speech, music and song discrimination. IEEE Signal Processing Letters, 17(7), 659-662. DOI
7	Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). I-vector based speaker adaptation of deep neural networks for French broadcast audio transcription. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 6334-6338).
8	Heittola, T., Mesaros, A., Virtanen, T., & Eronen, A. (2011). Sound event detection in multisource environments using source separation. Proceedings of Workshop Machine Listening in Multisource Environments (pp. 36-40).
9	Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97. DOI
10	Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, 10(3), 626-634. DOI
11	Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5), 411-430. DOI
12	Justusson, B. I. (1981). Median filtering: Statistical properties. In T. S. Huang (Ed.), Two-Dimensional Digital Signal Processing II (pp. 161-196). Berlin: Springer.
13	Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
14	Lee, G. H. (2015). A study on the appropriateness of music broadcasting fee of terrestrial broadcasters. Music Content and Law, 203-250.
15	Mirsa, H., Ikbal, S., Bourlard, H., & Hermansky, H. (2004). Spectral entropy based feature for robust ASR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 193-196).
16	Muller, M., & Ewert, S. (2011). Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. Proceedings of International Society for Music Information Retrieval Conference (ISMIR) (pp. 215-220).
17	Saon, G., Soltau, H., Nahamoo, D., & Picheny, M., (2013). Speaker adaptation of neural network acoustic models using i-vectors. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 55-59).
18	SBS (2015). SBS drama special: Mask. Retrieved from http://programs.sbs.co.kr/drama/2015mask on September 27, 2018.
19	Snyder, D., Chen, G., & Povey, D. (2015). Musan: A music, speech, and noise corpus. Retrieved from https://arxiv.org/abs/1510.08484v1 on September 27, 2018.
20	Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.
21	van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. The Journal of Machine Learning Research, 9(1), 2579-2605.
22	Wang, S. S., Lin, P., Lyu, D. C., Tsao, Y., Hwang, H. T., & Su, B. (2014). Acoustic feature conversion using a polynomial based feature transferring algorithm. Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 454-458).
23	Metallinou, A., Lee, S., & Narayanan, S. (2008). Audio-visual emotion recognition using Gaussian mixture models for face and voice. Proceedings of International Symposium on Multimedia (ISM) (pp. 250-257).

KSCI

Performance of music section detection in broadcast drama contents using independent component analysis and deep neural networks ICA와 DNN을 이용한 방송 드라마 콘텐츠에서 음악구간 검출 성능

Performance of music section detection in broadcast drama contents using independent component analysis and deep neural networks