Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content

Zgank, Andrej;

doi:10.4218/etrij.10.1510.0092

ETRI Journal

Volume 32 Issue 5
/
Pages.810-818
/
2010
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content

Zgank, Andrej (Faculty of Electrical Engineering and Computer Science, University of Maribor)

투고 : 2010.03.14
심사 : 2010.06.28
발행 : 2010.10.31

https://doi.org/10.4218/etrij.10.1510.0092 인용 PDF KSCI

Download PDF

⟨ Previous Next ⟩

초록

This paper presents a new framework for integrating untranscribed spoken content into the acoustic training of an automatic speech recognition system. Untranscribed spoken content plays a very important role for under-resourced languages because the production of manually transcribed speech databases still represents a very expensive and time-consuming task. We proposed two new methods as part of the training framework. The first method focuses on combining initial acoustic models using a data-driven metric. The second method proposes an improved acoustic training procedure based on unsupervised transcriptions, in which word endings were modified by broad phonetic classes. The training framework was applied to baseline acoustic models using untranscribed spoken content from parliamentary debates. We include three types of acoustic models in the evaluation: baseline, reference content, and framework content models. The best overall result of 18.02% word error rate was achieved with the third type. This result demonstrates statistically significant improvement over the baseline and reference acoustic models.

키워드

참고문헌

C.H. Lee, "On Automatic Speech Recognition at the Dawn of the 21st Century," IEICE Trans. Inf. Syst., vol. E86-D, no. 3, Mar. 2003, pp. 377-396.
H.Y. Jung, B.O. Kang, and Y. Lee, "Model Adaptation Using Discriminative Noise Adaptive Training Approach for New Environments," ETRI J., vol. 30, no. 6, Dec. 2008, pp. 865-867. https://doi.org/10.4218/etrij.08.0208.0256
J. Na, W. Choi, and D. Lee, "Design and Implementation of a Multimodal Input Device Using a Web Camera," ETRI J., vol. 30, no. 4, Aug. 2008, pp. 621-623. https://doi.org/10.4218/etrij.08.0208.0018
S. Kim, M. Ji, and H. Kim, "Noise-Robust Speaker Recognition Using Subband Likelihoods and Reliable-Feature Selection," ETRI J., vol. 30, no.1, Feb. 2008, pp. 89-100. https://doi.org/10.4218/etrij.08.0107.0108
T. Cincarek et al., "Development, Long-Term Operation and Portability of a Real-Environment Speech-Oriented Guidance System," IEICE Trans. Inf. Syst., vol. E91-D, no. 3, 2008, pp. 576-587. https://doi.org/10.1093/ietisy/e91-d.3.576
L. Lamel, J.L. Gauvain, and G. Adda, "Lightly Supervised and Unsupervised Acoustic Model Training," Computer Speech & Language, vol. 16, no. 1, 2002, pp. 115-129. https://doi.org/10.1006/csla.2001.0186
T. Cincarek et al., "Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training," IEICE Trans. Inf. Syst., vol. E91-D, no. 3, 2008, pp. 499-507. https://doi.org/10.1093/ietisy/e91-d.3.499
S. Novotney, R. Schwartz, and J. Ma, "Unsupervised Acoustic and Language Model Training with Small Amounts of Labelled Data," Proc. 2009 IEEE Int. Conf. Acoustics, Speech Signal Process., Apr. 19-24, 2009, pp. 4297-4300.
B. Chen, J.W. Kuo, and W.H. Tsai, "Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription," ICASSP, 2004, pp. 777-780.
J. Ma and R. Schwartz, "Unsupervised Versus Supervised Training of Acoustic Models," INTERSPEECH, 2008, pp. 2374- 2377.
F. Wessel and H. Ney, "Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition," ASRU Workshop, 2001, pp. 307-310.
P.J. Jang and A.G. Hauptmann, "Improving Acoustic Models with Captioned Multimedia Speech," IEEE Int. Conf. Multimedia Computing Syst., Florence, Italy, 1999, pp. 767-771.
B. Lecouteux et al., "Imperfect Transcript Driven Speech Recognition," Interspeech-ICSLP, Pittsburgh, PA, 2006, pp. 1626-1629.
A. Lambourne et al., "Speech-Based Real-Time Subtitling Services," Int. J. Speech Technol., vol. 7, no. 4, 2004, pp. 269-279. https://doi.org/10.1023/B:IJST.0000037071.39044.cc
J. Brousseau et al., "Automatic Closed-Caption of Live TV Broadcast News in French," Proc. Eurospeech, Geneva, Switzerland, Sept. 2003, pp. 1245-1248.
Z. Kacic, "Importance of Merging the Research Potentials for Surpassing the Language Barriers in the Frame of Next Generation Speech Technologies," Proc. Inf. Soc. Multi-Conf., Ljubljana, Slovenia, Oct. 2002, pp. 111-115.
M.S. Maucec, Z. Kacic, and B. Horvat, "Modelling Highly Inflected Languages," Inf. Sciences, vol. 166, no. 1, Oct. 2004, pp. 249-269. https://doi.org/10.1016/j.ins.2003.12.004
A. Zgank, Z. Kacic, and B. Horvat, "Large Vocabulary Continuous Speech Recognizer for Slovenian Language," Lecture Notes Computer Science, Springer Verlag, 2001, pp. 242-248.
S. Furui et al., "Analysis and Recognition of Spontaneous Speech Using Corpus of Spontaneous Japanese," Speech Commun., vol. 47, no. 1-2, Sept. 2005, pp. 208-219. https://doi.org/10.1016/j.specom.2005.02.010
F. Stouten et al., "Coping with Disfluencies in Spontaneous Speech Recognition: Acoustic Detection and Linguistic Context Manipulation," Speech Commun., vol. 48, no. 11, 2006, pp. 1590-1606. https://doi.org/10.1016/j.specom.2006.04.004
K.N. Lee and M. Chung, "Morpheme-Based Modeling of Pronunciation Variation for Large Vocabulary Continuous Speech Recognition in Korean," IEICE Trans. Inf. Syst., vol. E90-D, no. 7, July 2007, pp. 1063-1072. https://doi.org/10.1093/ietisy/e90-d.7.1063
A. Zgank, B. Horvat, and Z. Kacic, "Data-Driven Generation of Phonetic Broad Classes Based on Phoneme Confusion Matrix Similarity," Speech Commun., vol. 47, no. 3, 2005, pp. 379-393. https://doi.org/10.1016/j.specom.2005.03.011
A. Zgank et al., "BNSI Slovenian Broadcast News Database: Speech and Text Corpus," 9th European Conf. Speech Commun. Technol., Interspeech Lisboa, Lisbon, Portugal, Sept. 4-8, 2005.
C. Barras et al., "Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production," Speech Commun., vol. 33, no.1-2, 2001, pp. 5-22. https://doi.org/10.1016/S0167-6393(00)00067-4
A. Zgank et al, "SloParl: Slovenian Parliamentary Speech and Text Corpus for Large Vocabulary Continuous Speech Recognition," Proc. INTERSPEECH, ICSLP, Pittsburgh, PA, 2006, pp. 197-200.
H. Heuvel et al., "Annotation in the SpeechDat Projects," Int. J. Speech Technology, vol. 4, no. 2, 2001, pp. 127-143. https://doi.org/10.1023/A:1011375311203
D. Kim and D. Yook, "A Closed-Form Solution of Linear Spectral Transformation for Robust Speech Recognition," ETRI J., vol. 31, no. 4, Aug. 2009, pp. 454-456. https://doi.org/10.4218/etrij.09.0209.0012
A. Žgank et al., "The COST 278 MASPER Initiative: Crosslingual Speech Recognition with Large Telephone Databases," Proc. LREC, Lisbon, Portugal, May 2004, pp. 2107- 2110.
F.T. Johansen et al., "The COST 249 SpeechDat Multilingual Reference Recogniser," Proc. LREC, Athens, Greece, May 2000, pp. 1351-1355.

피인용 문헌

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS vol.47, pp.4, 2013, https://doi.org/10.1007/s10579-013-9216-5