DNN based Speech Detection for the Media Audio

Jang, Inseon;Ahn, ChungHyun;Seo, Jeongil;Jang, Younseon;

doi:10.5909/JBE.2017.22.5.632

Journal of Broadcast Engineering (방송공학회논문지)

Volume 22 Issue 5
/
Pages.632-642
/
2017
/
1226-7953(pISSN)
/
2287-9137(eISSN)

The Korean Institute of Broadcast and Media Engineers (한국방송∙미디어공학회)

DOI QR Code

DNN based Speech Detection for the Media Audio

미디어 오디오에서의 DNN 기반 음성 검출

Jang, Inseon (Media Research Division, ETRI) ;
Ahn, ChungHyun (Media Research Division, ETRI) ;
Seo, Jeongil (Media Research Division, ETRI) ;
Jang, Younseon (Dept. of Electronic Engineering, Chungnam National University)

장인선 (한국전자통신연구원 방송.미디어연구소 미디어연구본부 테라미디어연구그룹) ;
안충현 (한국전자통신연구원 방송.미디어연구소 미디어연구본부 테라미디어연구그룹) ;
서정일 (한국전자통신연구원 방송.미디어연구소 미디어연구본부 테라미디어연구그룹) ;
장윤선 (충남대학교 전자공학과)

Received : 2017.07.11
Accepted : 2017.07.27
Published : 2017.09.30

https://doi.org/10.5909/JBE.2017.22.5.632 Citation PDF KSCI KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a DNN based speech detection system using acoustic characteristics and context information of media audio. The speech detection for discriminating between speech and non-speech included in the media audio is a necessary preprocessing technique for effective speech processing. However, since the media audio signal includes various types of sound sources, it has been difficult to achieve high performance with the conventional signal processing techniques. The proposed method improves the speech detection performance by separating the harmonic and percussive components of the media audio and constructing the DNN input vector reflecting the acoustic characteristics and context information of the media audio. In order to verify the performance of the proposed system, a data set for speech detection was made using more than 20 hours of drama, and an 8-hour Hollywood movie data set, which was publicly available, was further acquired and used for experiments. In the experiment, it is shown that the proposed system provides better performance than the conventional method through the cross validation for two data sets.

본 논문에서는 미디어 오디오의 음향 특성 및 문맥 정보를 활용한 DNN 기반 음성 검출 시스템을 제안한다. 미디어 오디오 내에 포함되어 있는 음성과 비음성을 구분하기 위한 음성 검출 기법은 효과적인 음성 처리를 위해 필수적인 전처리 기술이지만 미디어 오디오 신호에는 다양한 형태의 음원이 복합적으로 포함되어 있으므로 기존의 신호처리 기법으로는 높은 성능을 얻기에는 어려움이 있었다. 제안하는 기술은 미디어 오디오의 고조파와 퍼커시브 성분을 분리하고, 오디오 콘텐츠에 포함된 문맥 정보를 반영하여 DNN 입력 벡터를 구성함으로써 음성 검출 성능을 개선할 수 있다. 제안하는 시스템의 성능을 검증하기 위하여 20시간 이상 분량의 드라마를 활용하여 음성 검출용 데이터 세트를 제작하였으며 범용으로 공개된 8시간 분량의 헐리우드 영화 데이터 세트를 추가로 확보하여 실험에 활용하였다. 실험에서는 두 데이터 세트에 대한 교차 검증을 통하여 제안하는 시스템이 기존 방법에 비해 우수한 성능을 보임을 확인하였다.

Keywords

References

D. Lee, S. Kim, and Y. Kay, "A speech recognition system based on a new endpoint estimation method jointly using audio/video informations," Journal of Broadcast Engineering, Vol. 8, No.2, pp.198-203, 2003.
G. Kim, J. Ryu, and N. Cho, "Voice activity detection using motion and variation of intensity in the mouth region," Journal of Broadcast Engineering, Vol. 17, No.3, pp.519-528, 2012. https://doi.org/10.5909/JBE.2012.17.3.519
DARPA Broadcast News Transcription and Understanding Workshop, 1998.
T. Hain, P. C. Woodland, "Segmentation and classification of broadcast news audio," Proceeding of International Conference on Spoken Language Processing (ICSLP), pp. 2727-2730, 1998.
L. Lu, H. J. Zhang, and S. Z. Li, "Content-based audio classification and segmentation by using support vector machines," Multimedia Systems, Vol. 8, No. 6, pp. 482-492, 2003. https://doi.org/10.1007/s00530-002-0065-0
T. L. Nwe and H. Li, "Broadcast news segmentation by audio type analysis," Proceeding of 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2005.
A. Misra, "Speech/nonspeech segmentation in web video," Proceeding of 13th Annual Conference of the International Speech Communication Association (INTERSPEECH 2012), September 9-13, Portland, Oregon, USA, pp. 1977-1980, 2012.
N. Ryant, M. Libeman, J. Yuan, "Speech activity detection on YouTube using deep neural network," Proceeding of 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013), August 25-29, Lyon, France, pp. 728-731, 2013.
F. Eyben, F. Weninger, S. Squartini and B. Schuller, "Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies," Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 483-487, 2013.
M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, Buckeye Corpus of Conversational Speech (2nd release), Department of Psychology, Ohio State University (Distributor), Columbus, OH, USA, 2007, www.buckeyecorpus.osu.edu (accessed Aug. 18, 2017).
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgrena, and V. Zue, "TIMIT acoustic-phonetic continuous speech corpus," 1993, https://catalog.ldc.upenn.edu/ldc93s1 (accessed Aug. 18, 2017).
B. Lehner, G. Widmer and R. Sonnleitner, "Improving voice activity detection in movies," Proceeding of 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), September 6-10, Dresden, Germany, pp. 2942-2946, 2015.
I. Jang, C. Ahn, Y. Jang, "Non-dialog section detection for the descriptive video service contents authoring," Journal of Broadcast Engineering, Vol. 19, No. 3, pp. 296-306, 2014. https://doi.org/10.5909/JBE.2014.19.3.296
I. Jang, C. Ahn, J. Seo, Y. Jang, "Enhanced feature extraction for speech detection in media audio," Proceeding of 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), August 20-24, Stockholm, Sweden, pp. 479-483, 2017.
D. FitzGerald, "Harmonic/percussive separation using median filtering," Proceeding of the 13th International Conference on Digital Audio Effects (DAFx-10), 2010.
C. Hsu, D "A tandem algorithm for singing pitch extraction and voice separation from music accompaniment," IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 5, pp. 1482-1491, 2012. https://doi.org/10.1109/TASL.2011.2182510
R. Fug, A. Niedermeier, J. Driedger, S. Disch, M. Muller "Harmonicpercussive- residual sound separation using the structure tensor on spectrograms," Proceeding of Acoustics, Speech and Signal Processing (ICASSP), 2016.
D. FitzGerald and M. Gainza, "Single channel vocal separation using median filtering and factorisation techniques," ISAST Transactions on Electronic and Signal Processing, Vol. 4, No. 1, pp. 62-73, 2010.
S. Leglaive, R. Hennequin, R. Badeau. "Singing voice detection with deep recurrent neural networks," Proceeding of Acoustics, Speech and Signal Processing (ICASSP), pp.121-125, 2015.