Improved Statistical Language Model for Context-sensitive Spelling Error Candidates (문맥의존 철자오류 후보 생성을 위한 통계적 언어모형 개선)

  • Lee, Jung-Hun;Kim, Minho;Kwon, Hyuk-Chul
    • Journal of Korea Multimedia Society
    • v.20 no.2
    • pp.371-381
    • 2017
  • The performance of the statistical context-sensitive spelling error correction depends on the quality and quantity of the data for statistical language model. In general, the size and quality of data in a statistical language model are proportional. However, as the amount of data increases, the processing speed becomes slower and storage space also takes up a lot. We suggest the improved statistical language model to solve this problem. And we propose an effective spelling error candidate generation method based on a new statistical language model. The proposed statistical model and the correction method based on it improve the performance of the spelling error correction and processing speed.

A Statistical Model for Choosing the Best Translation of Prepositions. (통계 정보를 이용한 전치사 최적 번역어 결정 모델)

  • 심광섭
    • Language and Information
    • v.8 no.1
    • pp.101-116
    • 2004
  • This paper proposes a statistical model for the translation of prepositions in English-Korean machine translation. In the proposed model, statistical information acquired from unlabeled Korean corpora is used to choose the best translation from several possible translations. Such information includes functional word-verb co-occurrence information, functional word-verb distance information, and noun-postposition co-occurrence information. The model was evaluated with 443 sentences, each of which has a prepositional phrase, and we attained 71.3% accuracy.

  • Tierney, Luke
    • Journal of the Korean Statistical Society
    • v.36 no.1
    • pp.31-55
    • 2007
  • R is an open source language for statistical computing and graphics based on the ACM software award-winning S language. R is widely used for data analysis and has become a major vehicle for making available new statistical methodology. This paper presents an overview of the design philosophy and the development model for R, reviews the basic capabilities of the system, and outlines some current projects that will influence future developments of R.

Language Modeling Approaches to Information Retrieval

  • Banerjee, Protima;Han, Hyo-Il
    • Journal of Computing Science and Engineering
    • v.3 no.3
    • pp.143-164
    • 2009
  • This article surveys recent research in the area of language modeling (sometimes called statistical language modeling) approaches to information retrieval. Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. The underlying assumption of language modeling is that human language generation is a random process; the goal is to model that process via a generative statistical model. In this article, we discuss current research in the application of language modeling to information retrieval, the role of semantics in the language modeling framework, cluster-based language models, use of language modeling for XML retrieval and future trends.

A statistical journey to DNN, the third trip: Language model and transformer (심층신경망으로 가는 통계 여행, 세 번째 여행: 언어모형과 트랜스포머)

  • Yu Jin Kim;In Jun Hwang;Kisuk Jang;Yoon Dong Lee
    • The Korean Journal of Applied Statistics
    • v.37 no.5
    • pp.567-582
    • 2024
  • Over the past decade, the remarkable advancements in deep neural networks have paralleled the development and evolution of language models. Initially, language models were developed in the form of Encoder-Decoder models using early RNNs. However, with the introduction of Attention in 2015 and the emergence of the Transformer in 2017, the field saw revolutionary growth. This study briefly reviews the development process of language models and examines in detail the working mechanism and technical elements of the Transformer. Additionally, it explores statistical models and methodologies related to language models and the Transformer.

Statistical Generation of Korean Chatting Sentences Using Multiple Feature Information (복합 자질 정보를 이용한 통계적 한국어 채팅 문장 생성)

  • Kim, Jong-Hwan;Chang, Du-Seong;Kim, Hark-Soo
    • Korean Journal of Cognitive Science
    • v.20 no.4
    • pp.421-437
    • 2009
  • A chatting system is a computer program that simulates conversations between a human and a computer using natural language. In this paper, we propose a statistical model to generate natural chatting sentences when keywords and speech acts are input. The proposed model first finds Eojeols (Korean spacing units) including input keywords from a corpus, and generate sentence candidates by using appearance information and syntactic information of Eojeols surrounding the found Eojeols. Then, the proposed model selects one among the sentence candidates by using a language model based on speech act information, co-occurrence information between Eojeols, and syntactic information of each Eojeol. In the experiment, the proposed model showed the better correct sentence generation rate of 86.2% than a previous conventional model based on a simple language model.

Statistical Analysis Between Size and Balance of Text Corpus by Evaluation of the effect of Interview Sentence in Language Modeling (언어모델 인터뷰 영향 평가를 통한 텍스트 균형 및 사이즈간의 통계 분석)

  • Jung Eui-Jung;Lee Youngjik
    • Proceedings of the Acoustical Society of Korea Conference
    • spring
    • pp.87-90
    • 2002
  • This paper analyzes statistically the relationship between size and balance of text corpus by evaluation of the effect of interview sentences in language model for Korean broadcast news transcription system. Our Korean broadcast news transcription system's ultimate purpose is to recognize not interview speech, but the anchor's and reporter's speech in broadcast news show. But the gathered text corpus for constructing language model consists of interview sentences a portion of the whole, $15\%$ approximately. The characteristic of interview sentence is different from the anchor's and the reporter's in one thing or another. Therefore it disturbs the anchor and reporter oriented language modeling. In this paper, we evaluate the effect of interview sentences in language model for Korean broadcast news transcription system and analyze statistically the relationship between size and balance of text corpus by making an experiment as the same procedure according to varying the size of corpus.

Automatic Generation of Multiple-Choice Questions Based on Statistical Language Model (통계 언어모델 기반 객관식 빈칸 채우기 문제 생성)

  • Park, Youngki
    • Journal of The Korean Association of Information Education
    • v.20 no.2
    • pp.197-206
    • 2016
  • A fill-in-the-blank with choices are widely used in classrooms in order to check whether students' understand what is being taught. Although there have been proposed many algorithms for generating this type of questions, most of them focus on preparing sentences with blanks rather than generating multiple choices. In this paper, we propose a novel algorithm for generating multiple choices, given a sentence with a blank. Because the algorithm is based on a statistical language model, we can generate relatively unbiased result and adjust the level of difficulty with ease. The experimental results show that our approach automatically produces similar multiple-choices to those of the exam writers.

The Statistical Relationship between Linguistic Items and Corpus Size (코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로)

  • 양경숙;박병선
    • Language and Information
    • v.7 no.2
    • pp.103-115
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

