DOI QR코드

DOI QR Code

Phraseological Analysis of Learner Corpus Based on Language Model

  • Published : 20180000

Abstract

The present study addresses how English expressions produced by Korean native speakers are close to common expressions used by English native speakers. To this end, this article provides a quantitative study of the Yonsei English Learner Corpus using a skill set derived from computational linguistics. The focus of the current work is on a language model of English texts written by Korean university students. A language model refers to a collection of logarithmic N-grams described in the ARPA format, and this model serves to discriminate native-like sentences from awkward sentences. The present study compares a language model acquired from an L2 corpus to the other language models acquired from two L1 corpora in English: namely, English Gigaword and Europarl. The present study utilizes Genia Sentence Splitter to separate the sentences and SRILM to create the language models in a computationally tractable way. On the one hand, a deep analysis of N-grams is presented. This analysis consists of two subtasks. First, the N-grams are tallied and evaluated using common metrics of computational linguistics. Second, as an evaluation of the language model, the perplexity of each language model is measured and compared to a reference point drawn from five test data sources. On the other hand, an analysis of linear regression is made so as to detect the patterns of overused and underused expressions in English texts written by Korean speakers.

Keywords