Performance of music section detection in broadcast drama contents using independent component analysis and deep neural networks
![]() |
Heo, Woon-Haeng
(충북대학교 제어로봇공학전공 대학원)
Jang, Byeong-Yong (충북대학교 제어로봇공학전공 대학원) Jo, Hyeon-Ho (충북대학교 제어로봇공학전공 대학원) Kim, Jung-Hyun (한국전자통신연구원) Kwon, Oh-Wook (충북대학교) |
1 | Aguilo, M., Butko, T., Temko, A., & Nadeu, C. (2009). A hierarchical architecture for audio segmentation in a broadcast news task. Proceedings of I Iberian SLTech (pp. 17-20). |
2 | Boersma, P., & Weenink, D. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9-10), 341-345. |
3 | Castan, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., Docio-Fernandez, L., Ramos, D., Serrano, J., Ortega, A., & Lleida, E. (2015). Albayzín-2014 evaluation: Audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 33. DOI |
4 | Galibert, O. (2013). Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. Proceedings of INTERSPEECH-2013 (pp. 1131-1134). |
5 | Gallardo-Antolin, A., & Hernandez, R. S. S. (2010). UPM-UC3M system for music and speech segmentation. Proceedings of VI Jornadas en Tecnología del Habla II Iberian SLTech Workshop (FALA) (pp. 421-424). |
6 | Gallardo-Antolin, A., & Montero, J. M. (2010). Histogram equalizationbased features for speech, music and song discrimination. IEEE Signal Processing Letters, 17(7), 659-662. DOI |
7 | Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). I-vector based speaker adaptation of deep neural networks for French broadcast audio transcription. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 6334-6338). |
8 | Heittola, T., Mesaros, A., Virtanen, T., & Eronen, A. (2011). Sound event detection in multisource environments using source separation. Proceedings of Workshop Machine Listening in Multisource Environments (pp. 36-40). |
9 | Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97. DOI |
10 | Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, 10(3), 626-634. DOI |
11 | Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5), 411-430. DOI |
12 | Justusson, B. I. (1981). Median filtering: Statistical properties. In T. S. Huang (Ed.), Two-Dimensional Digital Signal Processing II (pp. 161-196). Berlin: Springer. |
13 | Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). |
14 | Lee, G. H. (2015). A study on the appropriateness of music broadcasting fee of terrestrial broadcasters. Music Content and Law, 203-250. |
15 | Mirsa, H., Ikbal, S., Bourlard, H., & Hermansky, H. (2004). Spectral entropy based feature for robust ASR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 193-196). |
16 | Muller, M., & Ewert, S. (2011). Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. Proceedings of International Society for Music Information Retrieval Conference (ISMIR) (pp. 215-220). |
17 | Saon, G., Soltau, H., Nahamoo, D., & Picheny, M., (2013). Speaker adaptation of neural network acoustic models using i-vectors. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 55-59). |
18 | SBS (2015). SBS drama special: Mask. Retrieved from http://programs.sbs.co.kr/drama/2015mask on September 27, 2018. |
19 | Snyder, D., Chen, G., & Povey, D. (2015). Musan: A music, speech, and noise corpus. Retrieved from https://arxiv.org/abs/1510.08484v1 on September 27, 2018. |
20 | Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958. |
21 | van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. The Journal of Machine Learning Research, 9(1), 2579-2605. |
22 | Wang, S. S., Lin, P., Lyu, D. C., Tsao, Y., Hwang, H. T., & Su, B. (2014). Acoustic feature conversion using a polynomial based feature transferring algorithm. Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 454-458). |
23 | Metallinou, A., Lee, S., & Narayanan, S. (2008). Audio-visual emotion recognition using Gaussian mixture models for face and voice. Proceedings of International Symposium on Multimedia (ISM) (pp. 250-257). |
![]() |