DOI QR코드

DOI QR Code

Corpus-based evaluation of French text normalization

코퍼스 기반 프랑스어 텍스트 정규화 평가

  • Received : 2018.08.08
  • Accepted : 2018.09.27
  • Published : 2018.09.30

Abstract

This paper aims to present a taxonomy of non-standard words (NSW) for developing a French text normalization system and to propose a method for evaluating this system based on a corpus. The proposed taxonomy of French NSWs consists of 13 categories, including 2 types of letter-based categories and 9 types of number-based categories. In order to evaluate the text normalization system, a representative test set including NSWs from various text domains, such as news, literature, non-fiction, social-networking services (SNSs), and transcriptions, is constructed, and an evaluation equation is proposed reflecting the distribution of the NSW categories of the target domain to which the system is applied. The error rate of the test set is 1.64%, while the error rate of the whole corpus is 2.08%, reflecting the NSW distribution in the corpus. The results show that the literature and SNS domains are assessed as having higher error rates compared to the test set.

Keywords

References

  1. Adda, G., Adda-Decker, M., Gauvain, J. L., & Lamel, L. (1997). Text normalization and speech recognition in French. Fifth European Conference on Speech Communication and Technology.
  2. Adda-Decker, M. (2001). Towards multilingual interoperability in automatic speech recognition. Speech Communication, 35(1-2), 5-20. https://doi.org/10.1016/S0167-6393(00)00092-3
  3. Adda-Decker, M., Adda, G., Gauvain, J. L., & Lamel, L. (1998). On the use of speech and text corpora for speech recognition in French. First International Conference on Language Resources and Evaluation. Granada, Spain.
  4. Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. Retrieved https://arxiv.org/abs/1702.07825 arXiv preprint arXiv: 1702.07825.on September 27, 2018.
  5. Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-based statistical model for SMS text normalization. Proceedings of the COLING/ACL (pp. 33-40). Sydney, Austrailia.
  6. Bigi, B. (2011). A multilingual text normalization approach. Language and Technology Conference (pp. 515-526). Cham, Switzerland.
  7. Choi, Y., Jung, Y., Kim, Y., Suh, Y., & Kim, H. (2018). An end-to-end synthesis method for Korean text-to-speech systems. Phonetics and Speech Sciences, 10(1), 39-48.
  8. Ebden, P., & Sproat, R. (2015). The Kestrel TTS text normalization system. Natural Language Engineering, 21(3), 333-353. https://doi.org/10.1017/S1351324914000175
  9. Eisenstein, J. (2013). What to do about bad language on the internet. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (pp. 359-369).
  10. Festvox. (2000) Retrieved from http://festvox.org/nsw on September 27, 2018.
  11. Flint, E., Ford, E., Thomas, O., Caines, A., & Buttery, P. (2017). A text normalisation system for non-standard English words. Proceedings of the 3rd Workshop on Noisy User-Generated Text (pp. 107-115).
  12. Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a# twitter. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 368-378).
  13. Kim, S. (2017). Corpus-based evaluation of Chinese text normalization. 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1-4). Seoul, Korea.
  14. Kim, S. (2018). A knowledge-based pronunciation generation system for French. Phonetics and Speech Sciences, 10(1), 49-55.
  15. Moore, S., Buchholz, S., & Korhonen, A. (2010). Annotating the Enron Email Corpus with Number Senses. Seventh International Conference on Language Resources and Evaluation.
  16. Schlippe, T., Zhu, C., Lemcke, D., & Schultz, T. (2013). Statistical machine translation based text normalization with crowdsourcing. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8406-8410). Vancouver, Canada.
  17. Sproat, R., & Hall, K. (2014). Applications of maximum entropy rankers to problems in spoken language processing. Fifteenth Annual Conference of the International Speech Communication Association.
  18. Sproat, R., & Jaitly, N. (2016). RNN Approaches to text normalization: A challenge. Retrieved https://arxiv.org/abs/1611.00068 on September 27, 2018
  19. Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3), 287-333. https://doi.org/10.1006/csla.2001.0169
  20. Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). TACOTRON: TOWARDS END-TO-END SPEECH SYNTacotron: Towards end -to-end speech synthesis. Retrieved https://arxiv.org/abs/1703.10135 on September 27, 2018.
  21. Yvon, F., de Mareuil, P. B., d'Alessandro, C., Auberge, V., Bagein, M., Bailly, G., Bechet, F., Foukia, S., Goldman, J. F., Keller, E., O'Shaughnessy, D., Pagel, V., Sannier, F., Veronis, J., Zellner, B.. (1998). Objective evaluation of grapheme to phoneme conversion for text-to-speech synthesis in French. Computer Speech & Language, 12(4), 393-410. https://doi.org/10.1006/csla.1998.0104
  22. Zhou, T., Dong, Y., Huang, D., Liu, W., & Wang, H. (2008). A three-stage text normalization strategy for Mandarin text-to-speech systems. 2008 6th International Symposium on Chinese Spoken Language Processing (pp. 1-4). Kunming, China.