• Title/Summary/Keyword: SMS 말뭉치

Search Result 4, Processing Time 0.017 seconds

Analyzing Vocabulary Characteristics of Colloquial Style Corpus and Automatic Construction of Sentiment Lexicon (구어체 말뭉치의 어휘 사용 특징 분석 및 감정 어휘 사전의 자동 구축)

  • Kang, Seung-Shik;Won, HyeJin;Lee, Minhaeng
    • Smart Media Journal
    • /
    • v.9 no.4
    • /
    • pp.144-151
    • /
    • 2020
  • In a mobile environment, communication takes place via SMS text messages. Vocabularies used in SMS texts can be expected to use vocabularies of different classes from those used in general Korean literary style sentence. For example, in the case of a typical literary style, the sentence is correctly initiated or terminated and the sentence is well constructed, while SMS text corpus often replaces the component with an omission and a brief representation. To analyze these vocabulary usage characteristics, the existing colloquial style corpus and the literary style corpus are used. The experiment compares and analyzes the vocabulary use characteristics of the colloquial corpus SMS text corpus and the Naver Sentiment Movie Corpus, and the written Korean written corpus. For the comparison and analysis of vocabulary for each corpus, the part of speech tag adjective (VA) was used as a standard, and a distinctive collexeme analysis method was used to measure collostructural strength. As a result, it was confirmed that adjectives related to emotional expression such as'good-','sorry-', and'joy-' were preferred in the SMS text corpus, while adjectives related to evaluation expressions were preferred in the Naver Sentiment Movie Corpus. The word embedding was used to automatically construct a sentiment lexicon based on the extracted adjectives with high collostructural strength, and a total of 343,603 sentiment representations were automatically built.

Automatic Error Correction System for Erroneous SMS Strings (SMS 변형된 문자열의 자동 오류 교정 시스템)

  • Kang, Seung-Shik;Chang, Du-Seong
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.06a
    • /
    • pp.59-60
    • /
    • 2007
  • 휴대폰과 메신저 등 통신 환경에서 사용되는 표준어가 아닌 SMS의 변형된 어휘 및 띄어쓰기 오류를 자동으로 교정하여 형태소 분석 및 품사 태깅의 성능 저하 문제를 방지하는 문자열 오류의 교정 방법을 제안하였다. 통신 어휘들의 문자열 사전 구축 방법으로 통신어휘집을 기반으로 수동으로 구축하는 방법과 수작업으로 구축된 말뭉치로부터 자동으로 변형된 문자열을 추출하는 방법, 그리고 문맥을 고려하는 방법을 비교-분석하고 실험 및 성능 평가 결과를 제시하였다.

  • PDF

Automatic Error Correction System for Erroneous SMS Strings (SMS 변형된 문자열의 자동 오류 교정 시스템)

  • Kang, Seung-Shik;Chang, Du-Seong
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.6
    • /
    • pp.386-391
    • /
    • 2008
  • Some spoken word errors that violate grammatical or writing rules occurs frequently in communication environments like mobile phone and messenger. These unexpected errors cause a problem in a language processing system for many applications like speech recognition, text-to-speech translation, and so on. In this paper, we proposed and implemented an automatic correction system of ill-formed words and word spacing errors in SMS sentences that has been the major errors of poor accuracy. We experimented three methods of constructing the word correction dictionary and evaluated the results of those methods. They are (1) manual construction of error words from the vocabulary list of ill-formed communication languages, (2) automatic construction of error dictionary from the manually constructed corpus, and (3) context-dependent method of automatic construction of error dictionary.

Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Part-of-Speech Tagged Corpus (품사 부착 말뭉치를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선)

  • Lim, Min-Kyu;Kim, Kwang-Ho;Kim, Ji-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.181-193
    • /
    • 2008
  • In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using a part-of-speech (POS) tagged corpus. We investigate 152 POS tags defined in Lancaster-Oslo-Bergen (LOB) corpus and word-POS tag pairs. We derive a new vocabulary through word addition. Words paired with some POS tags have to be included in vocabularies with any size, but the vocabulary inclusion of words paired with other POS tags varies based on the target size of vocabulary. The 152 POS tags are categorized according to whether the word addition is dependent of the size of the vocabulary. Using expert knowledge, we classify POS tags first, and then apply different ways of word addition based on the POS tags paired with the words. The performance of the proposed method is measured in terms of coverage and is compared with those of vocabularies with the same size (5,000 words) derived from frequency lists. The coverage of the proposed method is measured as 95.18% for the test short message service (SMS) text corpus, while those of the conventional vocabularies cover only 93.19% and 91.82% of words appeared in the same SMS text corpus.

  • PDF