[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.4218/etrij.10.1510.0092

Three-Stage Framework for Unsupervised Acoustic Modeling Using Untranscribed Spoken Content

Zgank, Andrej (Faculty of Electrical Engineering and Computer Science, University of Maribor)

Publication Information

ETRI Journal / v.32, no.5, 2010 , pp. 810-818 More about this Journal

Abstract

This paper presents a new framework for integrating untranscribed spoken content into the acoustic training of an automatic speech recognition system. Untranscribed spoken content plays a very important role for under-resourced languages because the production of manually transcribed speech databases still represents a very expensive and time-consuming task. We proposed two new methods as part of the training framework. The first method focuses on combining initial acoustic models using a data-driven metric. The second method proposes an improved acoustic training procedure based on unsupervised transcriptions, in which word endings were modified by broad phonetic classes. The training framework was applied to baseline acoustic models using untranscribed spoken content from parliamentary debates. We include three types of acoustic models in the evaluation: baseline, reference content, and framework content models. The best overall result of 18.02% word error rate was achieved with the third type. This result demonstrates statistically significant improvement over the baseline and reference acoustic models.

Keywords

Automatic speech recognition; acoustic modeling; untranscribed spoken content; data-driven metric; imperfect transcriptions;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)
Times Cited By Web Of Science : 2 (Related Records In Web of Science)
Times Cited By SCOPUS : 2

Reference
Cited By KSCI

1	T. Cincarek et al., "Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training," IEICE Trans. Inf. Syst., vol. E91-D, no. 3, 2008, pp. 499-507. DOI ScienceOn
2	H.Y. Jung, B.O. Kang, and Y. Lee, "Model Adaptation Using Discriminative Noise Adaptive Training Approach for New Environments," ETRI J., vol. 30, no. 6, Dec. 2008, pp. 865-867. DOI ScienceOn
3	J. Na, W. Choi, and D. Lee, "Design and Implementation of a Multimodal Input Device Using a Web Camera," ETRI J., vol. 30, no. 4, Aug. 2008, pp. 621-623. DOI ScienceOn
4	A. Zgank et al., "BNSI Slovenian Broadcast News Database: Speech and Text Corpus," 9th European Conf. Speech Commun. Technol., Interspeech Lisboa, Lisbon, Portugal, Sept. 4-8, 2005.
5	C.H. Lee, "On Automatic Speech Recognition at the Dawn of the 21st Century," IEICE Trans. Inf. Syst., vol. E86-D, no. 3, Mar. 2003, pp. 377-396.
6	D. Kim and D. Yook, "A Closed-Form Solution of Linear Spectral Transformation for Robust Speech Recognition," ETRI J., vol. 31, no. 4, Aug. 2009, pp. 454-456. DOI ScienceOn
7	A. Žgank et al., "The COST 278 MASPER Initiative: Crosslingual Speech Recognition with Large Telephone Databases," Proc. LREC, Lisbon, Portugal, May 2004, pp. 2107- 2110.
8	F.T. Johansen et al., "The COST 249 SpeechDat Multilingual Reference Recogniser," Proc. LREC, Athens, Greece, May 2000, pp. 1351-1355.
9	K.N. Lee and M. Chung, "Morpheme-Based Modeling of Pronunciation Variation for Large Vocabulary Continuous Speech Recognition in Korean," IEICE Trans. Inf. Syst., vol. E90-D, no. 7, July 2007, pp. 1063-1072. DOI ScienceOn
10	A. Zgank, B. Horvat, and Z. Kacic, "Data-Driven Generation of Phonetic Broad Classes Based on Phoneme Confusion Matrix Similarity," Speech Commun., vol. 47, no. 3, 2005, pp. 379-393. DOI ScienceOn
11	C. Barras et al., "Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production," Speech Commun., vol. 33, no.1-2, 2001, pp. 5-22. DOI ScienceOn
12	A. Zgank et al, "SloParl: Slovenian Parliamentary Speech and Text Corpus for Large Vocabulary Continuous Speech Recognition," Proc. INTERSPEECH, ICSLP, Pittsburgh, PA, 2006, pp. 197-200.
13	H. Heuvel et al., "Annotation in the SpeechDat Projects," Int. J. Speech Technology, vol. 4, no. 2, 2001, pp. 127-143. DOI ScienceOn
14	F. Stouten et al., "Coping with Disfluencies in Spontaneous Speech Recognition: Acoustic Detection and Linguistic Context Manipulation," Speech Commun., vol. 48, no. 11, 2006, pp. 1590-1606. DOI ScienceOn
15	M.S. Maucec, Z. Kacic, and B. Horvat, "Modelling Highly Inflected Languages," Inf. Sciences, vol. 166, no. 1, Oct. 2004, pp. 249-269. DOI ScienceOn
16	A. Zgank, Z. Kacic, and B. Horvat, "Large Vocabulary Continuous Speech Recognizer for Slovenian Language," Lecture Notes Computer Science, Springer Verlag, 2001, pp. 242-248.
17	S. Furui et al., "Analysis and Recognition of Spontaneous Speech Using Corpus of Spontaneous Japanese," Speech Commun., vol. 47, no. 1-2, Sept. 2005, pp. 208-219. DOI ScienceOn
18	P.J. Jang and A.G. Hauptmann, "Improving Acoustic Models with Captioned Multimedia Speech," IEEE Int. Conf. Multimedia Computing Syst., Florence, Italy, 1999, pp. 767-771.
19	J. Ma and R. Schwartz, "Unsupervised Versus Supervised Training of Acoustic Models," INTERSPEECH, 2008, pp. 2374- 2377.
20	F. Wessel and H. Ney, "Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition," ASRU Workshop, 2001, pp. 307-310.
21	B. Lecouteux et al., "Imperfect Transcript Driven Speech Recognition," Interspeech-ICSLP, Pittsburgh, PA, 2006, pp. 1626-1629.
22	A. Lambourne et al., "Speech-Based Real-Time Subtitling Services," Int. J. Speech Technol., vol. 7, no. 4, 2004, pp. 269-279. DOI
23	J. Brousseau et al., "Automatic Closed-Caption of Live TV Broadcast News in French," Proc. Eurospeech, Geneva, Switzerland, Sept. 2003, pp. 1245-1248.
24	Z. Kacic, "Importance of Merging the Research Potentials for Surpassing the Language Barriers in the Frame of Next Generation Speech Technologies," Proc. Inf. Soc. Multi-Conf., Ljubljana, Slovenia, Oct. 2002, pp. 111-115.
25	S. Novotney, R. Schwartz, and J. Ma, "Unsupervised Acoustic and Language Model Training with Small Amounts of Labelled Data," Proc. 2009 IEEE Int. Conf. Acoustics, Speech Signal Process., Apr. 19-24, 2009, pp. 4297-4300.
26	S. Kim, M. Ji, and H. Kim, "Noise-Robust Speaker Recognition Using Subband Likelihoods and Reliable-Feature Selection," ETRI J., vol. 30, no.1, Feb. 2008, pp. 89-100. DOI ScienceOn
27	T. Cincarek et al., "Development, Long-Term Operation and Portability of a Real-Environment Speech-Oriented Guidance System," IEICE Trans. Inf. Syst., vol. E91-D, no. 3, 2008, pp. 576-587. DOI ScienceOn
28	L. Lamel, J.L. Gauvain, and G. Adda, "Lightly Supervised and Unsupervised Acoustic Model Training," Computer Speech & Language, vol. 16, no. 1, 2002, pp. 115-129. DOI ScienceOn
29	B. Chen, J.W. Kuo, and W.H. Tsai, "Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription," ICASSP, 2004, pp. 777-780.