[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2019.11.4.089

Speech detection from broadcast contents using multi-scale time-dilated convolutional neural networks

Jang, Byeong-Yong (School of Electronics Engineering, Chungbuk National University)
Kwon, Oh-Wook (School of Electronics Engineering, Chungbuk National University)

Publication Information

Phonetics and Speech Sciences / v.11, no.4, 2019 , pp. 89-96 More about this Journal

Abstract

In this paper, we propose a deep learning architecture that can effectively detect speech segmentation in broadcast contents. We also propose a multi-scale time-dilated layer for learning the temporal changes of feature vectors. We implement several comparison models to verify the performance of proposed model and calculated the frame-by-frame F-score, precision, and recall. Both the proposed model and the comparison model are trained with the same training data, and we train the model using 32 hours of Korean broadcast data which is composed of various genres (drama, news, documentary, and so on). Our proposed model shows the best performance with F-score 91.7% in Korean broadcast data. The British and Spanish broadcast data also show the highest performance with F-score 87.9% and 92.6%. As a result, our proposed model can contribute to the improvement of performance of speech detection by learning the temporal changes of the feature vectors.

Keywords

speech detection; multi-scale time-dilated convolution; deep learning; broadcast data;

Citations & Related Records

Reference

1	Butko, T., & Nadeu, C. (2011). Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1-10. DOI
2	Castan, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., Docio-Fernandez, L., ... Lleida, E. (2015). Albayzin-2014 evaluation: audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015(33), 1-9. DOI
3	Doukhan, D., Lechapt, E., Evrard, M., & Carrive, J. (2018). Ina's MIREX 2018 music and speech detection system. Music Information Retrieval Evaluation eXchange (MIREX).
4	Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. DOI
5	He, K., Zhang, X., Ren, S., & Sun, J. (2016, June). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
6	LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. DOI
7	Lu, R., & Duan, Z. (2017). Bidirectional GRU for sound event detection. Detection and Classification of Acoustic Scenes and Events.
8	Mesaros, A., Heittola, T., & Virtanen, T. (2016). Metrics for polyphonic sound event detection. Applied Sciences, 6(6), 162. DOI
9	Mirex (2015). Music/speech classification and detection. Retrieved from http://www.music-ir.org/mirex/wiki/2015:Music/Speech_Classifi-cation_and_Detection
10	Mirex (2018). Music and/or speech detection. Retrieved from http://www.music-ir.org/mirex/wiki/2018:Music_and/or_Speech_Detection
11	Zuo, Z., Shuai, B., Wang, G., Liu, X., Wang, X., Wang, B., & Chen, Y. (2015, June). Convolutional recurrent neural networks: Learning spatial dependencies for image representation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 18-26).
12	Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In 15th Annual Conference of the International Speech Communication Association (Interspeech-2014) (pp. 338-342). Singapore.
13	Tsipas, N., Vrysis, L., Dimoulas, C., & Papanikolaou, G. (2017). Efficient audio-driven multimedia indexing through similaritybased speech/music discrimination. Multimedia Tools and Applications, 76(24), 25603-25621. DOI
14	Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Retrieved from https://arxiv.org/abs/1511.07122.
15	Zhang, Q., Cui, Z., Niu, X., Geng, S., & Qiao, Y. (2017). Image segmentation with pyramid dilated convolution based on ResNet and U-Net. In International Conference on Neural Information Processing (pp. 364-372).

KSCI

Speech detection from broadcast contents using multi-scale time-dilated convolutional neural networks 다중 스케일 시간 확장 합성곱 신경망을 이용한 방송 콘텐츠에서의 음성 검출

Speech detection from broadcast contents using multi-scale time-dilated convolutional neural networks