DOI QR코드

DOI QR Code

DNN-based acoustic modeling for speech recognition of native and foreign speakers

원어민 및 외국인 화자의 음성인식을 위한 심층 신경망 기반 음향모델링

  • Received : 2017.05.05
  • Accepted : 2017.06.13
  • Published : 2017.06.30

Abstract

This paper proposes a new method to train Deep Neural Network (DNN)-based acoustic models for speech recognition of native and foreign speakers. The proposed method consists of determining multi-set state clusters with various acoustic properties, training a DNN-based acoustic model, and recognizing speech based on the model. In the proposed method, hidden nodes of DNN are shared, but output nodes are separated to accommodate different acoustic properties for native and foreign speech. In an English speech recognition task for speakers of Korean and English respectively, the proposed method is shown to slightly improve recognition accuracy compared to the conventional multi-condition training method.

Keywords

References

  1. Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533-1545. https://doi.org/10.1109/TASLP.2014.2339736
  2. Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term recurrent neural network architectures for large scale acoustic modeling. Proceedings of INTERSPEECH-2014 (pp. 338-342). 2014.
  3. Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modelling. Proceedings of ARPA Human Language Technology Workshop (pp. 307-312). 1994.
  4. Chen, X., & Cheng, J. (2012). Acoustic modeling for native and non-native Mandarin speech recognition. Proceedings of International Symposium on Chinese Spoken Language Processing. 2012.
  5. Kang, B., Jung, H., & Kwon, O. (2013). Noise robust spontaneous speech recognition using multi-space GMM. Proceedings of INTERNOISE-2013. Innsbruck, Austria. September, 2013.
  6. Kang, B., & Kwon. O. (2016). Combining multiple acoustic models in GMM spaces for robust speech recognition. IEICE Transactions on Information and Systems, 99(3), 724-730.
  7. Lee, S., Kang, B., Chung, H., & Lee, Y. (2014). Intra- and inter-frame features for automatic speech recognition. ETRI Journal, 36(3), 514-517. https://doi.org/10.4218/etrij.14.0213.0181
  8. Lee, S., Kang, B., Chung, H., & Park, J. (2015). A useful feature-engineering approach for a LVCSR System based on CD-DNN-HMM Algorithm. Proceedings of the 2015 European Signal Processing Conference (pp. 1436-1440). September, 2015.
  9. Mohamed, A. R., Hinton, G., & Penn, G. (2012). Understanding how deep belief networks perform acoustic modelling. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (pp. 4274-4276). 2012.
  10. Carnegie Mellon University. Carnegie Mellon Pronunciation Dictionary. Retrieved from http://www.speech.cs.cmu.edu/cgi-bin/ cmudict on March 2, 2015.
  11. Kwon, O., Lee, K., Roh, Y., Huang, J., Choi, S., Kim, Y., Jeon, H.,Oh, Y., Lee, Y., Kang, B., Chung, E., Park, J., & Lee, Y. (2015). GenieTutor: A computer assisted second language learning system based on spoken language understanding. Proceedings of the International Workshop on Spoken Dialog System (IWSDS 2015). Busan, South Korea. January, 2015.
  12. Chung, H., Park, J., Jeon, H., & Lee, Y. (2009). Fast speech recognition for voice destination entry in a car navigation system. Proceedings of INTERSPEECH-2009 (pp. 975-978). Brighton, UK. 2009.
  13. Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. Proceedings of ICSLP-1992 (pp. 899-902). October, 1992.