A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)
-
- MALSORI
- /
- v.68
- /
- pp.95-114
- /
- 2008
The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.
Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.
This study is based on the Korean-Chinese parallel corpus, utilizing the Korean connective morpheme '-myenseo' and contrasting with the Chinese expression. Korean learners often struggle with the use of Korean Connective Morpheme especially when there is a lexical gap between their mother language. '-myenseo' is of the most use Korean Connective Morpheme, it usually contrast to the Chinese coordinating conjunction. But according to the corpus, the contrastive Chinese expression to '-myenseo' is more than coordinating conjunction. So through this study, can help the Chinese Korean language learners learn easier while studying '-myenseo', because the variety Chinese expression are found from the parallel corpus that related to '-myenseo'. In this study, firstly discussed the semantic features and syntactic characteristics of '-myenseo'. The significant semantic features of '-myenseo' are 'simultaneous' and 'conflict'. So in this chapter the study use examples of usage to analyse the specific usage of '-myenseo'. And then this study analyse syntactic characteristics of '-myenseo' through the subject constraint, predicate constraints, temporal constraints, mood constraints, negatives constraints. then summarize them into a table. And the most important part of this study is Chapter 4. In this chapter, it contrasted the Korean connective morpheme '-myenseo' to the Chinese expression by analysing the Korean-Chinese parallel corpus. As a result of the analysis, the frequency of the Chinese expression that contrasted to '-myenseo' is summarized into