A Study on Verification of Back TranScription(BTS)-based Data Construction

Park, Chanjun;Seo, Jaehyung;Lee, Seolhwa;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok;

doi:10.15207/JKCS.2021.12.11.109

Journal of the Korea Convergence Society (한국융합학회논문지)

Volume 12 Issue 11
/
Pages.109-117
/
2021
/
2233-4890(pISSN)
/
2713-6353(eISSN)

Korea Convergence Society (한국융합학회)

DOI QR Code

A Study on Verification of Back TranScription(BTS)-based Data Construction

Back TranScription(BTS)기반 데이터 구축 검증 연구

Park, Chanjun (Department of Computer Science and Engineering, Korea University) ;
Seo, Jaehyung (Department of Computer Science and Engineering, Korea University) ;
Lee, Seolhwa (Department of Computer Science and Engineering, Korea University) ;
Moon, Hyeonseok (Department of Computer Science and Engineering, Korea University) ;
Eo, Sugyeong (Department of Computer Science and Engineering, Korea University) ;
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)

박찬준 (고려대학교 컴퓨터학과) ;
서재형 (고려대학교 컴퓨터학과) ;
이설화 (고려대학교 컴퓨터학과) ;
문현석 (고려대학교 컴퓨터학과) ;
어수경 (고려대학교 컴퓨터학과) ;
임희석 (고려대학교 컴퓨터학과)

Received : 2021.08.12
Accepted : 2021.11.20
Published : 2021.11.28

https://doi.org/10.15207/JKCS.2021.12.11.109 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, the use of speech-based interfaces is increasing as a means for human-computer interaction (HCI). Accordingly, interest in post-processors for correcting errors in speech recognition results is also increasing. However, a lot of human-labor is required for data construction. in order to manufacture a sequence to sequence (S2S) based speech recognition post-processor. To this end, to alleviate the limitations of the existing construction methodology, a new data construction method called Back TranScription (BTS) was proposed. BTS refers to a technology that combines TTS and STT technology to create a pseudo parallel corpus. This methodology eliminates the role of a phonetic transcriptor and can automatically generate vast amounts of training data, saving the cost. This paper verified through experiments that data should be constructed in consideration of text style and domain rather than constructing data without any criteria by extending the existing BTS research.

최근 인간과 컴퓨터의 상호작용(HCI)을 위한 수단으로 음성기반 인터페이스의 사용률이 높아지고 있다. 이에 음성인식 결과에 오류를 교정하기 위한 후처리기에 대한 관심 또한 높아지고 있다. 그러나 sequence to sequence(S2S)기반의 음성인식 후처리기를 제작하기 위해서는 데이터 구축을 위해 human-labor가 많이 소요된다. 최근 기존의 구축 방법론의 한계를 완화하기 위하여 음성인식 후처리기를 위한 새로운 데이터 구축 방법론인 Back TranScription(BTS)이 제안되었다. BTS란 TTS와 STT 기술을 결합하여 pseudo parallel corpus를 생성하는 기술을 의미한다. 해당 방법론은 전사자(phonetic transcriptor)의 역할을 없애고 방대한 양의 학습 데이터를 자동으로 생성할 수 있기에 데이터 구축에 있어서 시간과 비용을 단축할 수 있다. 본 논문은 기존의 BTS 연구를 확장하여 어떠한 기준 없이 데이터를 구축하는 것보다 어투와 도메인을 고려하여 데이터 구축을 해야함을 실험을 통해 검증을 진행하였다.

Keywords

Acknowledgement

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation)" and this research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425).

References

S. K. Kaya, T. Paksoy & J. A. Garza-Reyes. (2020). The New Challenge of Industry 4.0. Logistics 4.0: Digital Transformation of Supply Chain Management, 51.
P. Aleksic, M. Ghodsi, A. Michaely, C. Allauzen, K. Hall, B. Roark & P. Moreno. (2015). Bringing contextual information to google speech recognition.
J. W. Ha, K. Nam, J. Kang, S. W. Lee, S. Yang, H. Jung & S. Kim. (2020). ClovaCall: Korean goal-oriented dialog speech corpus for automatic speech recognition of contact centers. arXiv preprint arXiv:2004.09367.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel & K. Vesely. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
M. N. Stuttle. (2003). A Gaussian mixture model spectral representation for speech recognition (Doctoral dissertation, University of Cambridge).
M. Gales & S. Young. (2008). The application of hidden Markov models in speech recognition.
A. Baevski, H. Zhou, A. Mohamed & M. Auli. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
C. Wang, J. Pino & J. Gu. (2020). Improving cross-lingual transfer learning for end-to-end speech recognition with speech translation. arXiv preprint arXiv:2006.05474.
Z. Q. Zhang, Y. Song, M. H. Wu, X. Fang & L. R. Dai. (2021). XLST: Cross-lingual Self-training to Learn Multilingual Representation for Low Resource Speech Recognition. arXiv preprint arXiv:2103.08207.
C. Park, Y. Yang, K. Park & H. Lim. (2020). Decoding strategies for improving low-resource machine translation. Electronics, 9(10), 1562. https://doi.org/10.3390/electronics9101562
C. Park, S. Eo, H. Moon & H. S. Lim. (2021, June). Should we find another model?: Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers (pp. 97-104).
K. Voll, S. Atkins & B. Forster. (2008). Improving the utility of speech recognition through error detection. Journal of digital imaging, 21(4), 371. https://doi.org/10.1007/s10278-007-9034-7
A. Mani, S. Palaskar, N. V. Meripo, S. Konam & F. Metze. (2020, May). ASR error correction and domain adaptation using machine translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6344-6348). IEEE.
J. Liao, S. E. Eskimez, L. Lu, Y. Shi, M. Gong, L. Shou & M. Zeng. (2020). Improving readability for automatic speech recognition transcription. arXiv preprint arXiv:2004.04438.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez & I. Polosukhin. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
C. Park, J. Seo, S. Lee, C. Lee, H. Moon, S. Eo & H. Lim. (2021). BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text. Proceedings of the 8th Workshop on Asian Translation, (pp. 106-116).
M. Paulik, S. Rao, I. Lane, S. Vogel & T. Schultz, (2008, March). Sentence segmentation and punctuation recovery for spoken language translation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5105-5108). IEEE.
S. Skodova, M. Kucharova & L. Seps, (2012, September). Discretion of speech units for the text post-processing phase of automatic transcription (in the czech language). In International Conference on Text, Speech and Dialogue (pp. 446-455). Springer, Berlin, Heidelberg.
H. Cucu, A. Buzo, L. Besacier & C. Burileanu. (2013, July). Statistical error correction methods for domain-specific ASR systems. In International Conference on Statistical Language and Speech Processing (pp. 83-92). Springer, Berlin, Heidelberg.
C. Park, K. Kim, Y. Yang, M. Kang & H. Lim. (2020). Neural spelling correction: translating incorrect sentences to correct sentences for multimedia. Multimedia Tools and Applications, 1-18.
C. Park, Y. Yang, C. Lee & H. Lim. (2020). Comparison of the evaluation metrics for Neural Grammatical Error Correction with Overcorrection. IEEE Access, 8, 106264-106272. https://doi.org/10.1109/access.2020.2998149
Z. Chi, S. Huang, L. Dong, S. Ma, S. Singhal, P. Bajaj & F. Wei. (2021). XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. arXiv preprint arXiv:2106.16138.
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant & C. Raffel. (2020). mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
C. Lee & H. Kim. (2013). Automatic Korean word spacing using Pegasos algorithm. Information processing & management, 49(1), 370-379. https://doi.org/10.1016/j.ipm.2012.05.004
J. Yi, J. Tao, Y. Bai, Z. Tian & C. Fan. (2020). Adversarial transfer learning for punctuation restoration. arXiv preprint arXiv:2004.00248.
C. Park & H. Lim. (2020). A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus. Journal of Digital Convergence, 18(6), 271-277. DOI : 10.14400/JDC.2020.18.6.271
C. Park, Y. Lee, C. Lee & H. Lim. (2020). Quality, not quantity?: Effect of parallel corpus quantity and quality on neural machine translation. In The 32st Annual Conference on Human Cognitive Language Technology (pp. 363-368).
H. Moon, C. Park, S. Eo, J. Park & H. Lim. (2021). Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering. Journal of the Korea Convergence Society, 12(5), 1-7. DOI : /10.15207/JKCS.2021.12.5.001
K. Papineni, S. Roukos, T. Ward & W. J. Zhu. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
K. Sakaguchi, C. Napoles, M. Post & J. Tetreault. (2016). Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 4, 169-182. https://doi.org/10.1162/tacl_a_00091
T. Kudo & J. Richardson. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.