Automatic Error Correction System for Erroneous SMS Strings

Kang, Seung-Shik;Chang, Du-Seong;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 35 Issue 6
/
Pages.386-391
/
2008
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Automatic Error Correction System for Erroneous SMS Strings

SMS 변형된 문자열의 자동 오류 교정 시스템

강승식 (국민대학교 컴퓨터공학부) ;
장두성 (KT 미래기술연구소)

Published : 2008.06.15

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Some spoken word errors that violate grammatical or writing rules occurs frequently in communication environments like mobile phone and messenger. These unexpected errors cause a problem in a language processing system for many applications like speech recognition, text-to-speech translation, and so on. In this paper, we proposed and implemented an automatic correction system of ill-formed words and word spacing errors in SMS sentences that has been the major errors of poor accuracy. We experimented three methods of constructing the word correction dictionary and evaluated the results of those methods. They are (1) manual construction of error words from the vocabulary list of ill-formed communication languages, (2) automatic construction of error dictionary from the manually constructed corpus, and (3) context-dependent method of automatic construction of error dictionary.

휴대폰과 메신저 등 통신 환경에서 문자 메시지를 전송할 때 표준어가 아닌 왜곡된 어휘들을 사용하고 있으며, 이러한 변형된 어휘들은 음성 인식, 음성 합성, 문서 정보 추출 등 언어처리 및 관련 분야의 응용 시스템에서 많은 문제점을 유발시킨다. 본 논문에서는SMS 문장들의 변형 및 띄어쓰기 오류를 자동으로 교정하여 형태소 분석 및 품사 태깅의 성능 저하 문제를 방지하는 문자열 오류의 교정 방법을 제안하고 시스템을 구현하였다. 시스템의 성능에 가장 큰 영향을 미치는 변형된 문자열 사전을 구축하는 방법으로 (1) 통신 어휘집을 기반으로 수동으로 구축하는 방법, (2) 수작업으로 구축된 말뭉치로부터 자동으로 변형된 문자열을 추출하는 방법, (3) 자동으로 변형된 문자열을 추출할 때 좌우 문맥을 고려하는 방법에 대하여 시스템을 구현하고 실험을 통하여 비교-분석 및 성능 평가 결과를 제시하였다.

Keywords

References

권연진, '컴퓨터 통신어의 언어학적 연구', 언어과학, 5권, 2호, pp. 58-62, 1998
조찬식, '인터넷상에서의 언어 사용에 관한 연구', 한국문헌정보학회지, 35권 4호, pp. 177-196, 2001
차인태, 'PC 통신 언어 분석', 음성과학, 8권 3호, pp. 75-91, 2001
김보영, 강승식, '자모 빈도에 의한 통신 언어의 특성 연구', 제19회 한국 정보처리학회 춘계 학술발표 논문집, 10권 1호, pp. 501-504, 2003
이정복, '컴퓨터 통신 분야의 외래어 사용', 새국어생활, 8권 2호, 국립국어연구원, 1998
이정복, '통신 언어 문장 종결법의 특성', 우리말글, 22집, pp. 123-151, 2001
임동희, 강승식, 장두성, '음성 인식 후처리를 위한 띄어쓰기 오류의 교정', 한국 컴퓨터 종합 학술대회(KCC 2006) 논문집, Vol.33, pp. 25-27, 2006
이재성, '영한 병렬 코퍼스로부터 외래어 표기 사전의 자동 구축', 컴퓨터교육학회논문지, 한국컴퓨터교육학회, 6권, 2호, pp. 9-21, 2003
Christian Jacquemin, Spotting and Discovering Terms Through Natural Language Processing, MIT press, 2001
Seung-Shik Kang, Kyu-Baek Hwang, 'A Language Independent n-gram Model for Word Segmentation', AI'2006, pp. 557-565, 2006(LNAI 4304)
김용경, 조오현, 박동근, 컴퓨터 통신 언어 사전, 역락사, 2002
조오현, 김경용, 박동근, '통신언어의 실태와 개선 방안', 통신언어 어휘집, 문화관광부, 2001

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Automatic Error Correction System for Erroneous SMS Strings

SMS 변형된 문자열의 자동 오류 교정 시스템

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)