A stemming algorithm for a korean language free-text retrieval system

자연어검색시스템을 위한 스태밍알고리즘의 설계 및 구현

  • Published : 1997.12.01

Abstract

A stemming algorithm for the Korean language free-text retrieval system has been designed and implemented. The algorithm contains three major parts and it operates iteratively ; firstly, stop-words are removed with a use of a stop-word list ; secondly, a basic removing procedure proceeds with a rule table 1, which contains the suffixes, the postpositional particles, and the optionally adopted symbols specifying an each stemming action ; thirdly, an extended stemming and rewriting procedures continue with a rule table 2, which are composed of th suffixes and the optionally combined symbols representing various actions depending upon the context-sensitive rules. A test was carried out to obtain an indication of how successful the algorithm was and to identify any minor changes in the algorithm for an enhanced one. As a result of it, 21.4 % compression is achieved and an error rate is 15.9%.

본 연구에서는 자연어 검색시스템을 위한 스태밍알고리즘을 설계하고 이를 구현하였다. 알고리즘은 순환적으로 다음과 같은 세가지 과정으로 진행된다. : 불용어사전에 의한 불용어의 제거; 규칙 테이블1의 적용에 따른 기본 어미의 처리; 전단계에서 처리되고 남은 어절에 대해 규칙테이블 2를 적용하여 확장스태밍 및 다시쓰기루틴으로 진행된다. 알고리즘의 성능 평가를 위한 한글문헌집단을 사용하여 테스트한 결과 압축률 21.4%, 오류율 15.9%의 결과를 나타내었다.

Keywords

References

  1. 박사학위논문 : 서울대학교 음절정보와 복수어 단위 정보를 이용한 한국어 형태소 분석 강승식
  2. 우리말형태론 김석득
  3. 한국정보관리학회지 v.11 no.1 자동색인기 성능시험을 위한 Test set 개발 김성혁(외)
  4. 우리말 토씨에 관한 연구 김승곤
  5. 표준국어문법론 남기심
  6. 국어 활용어미의 형태와 의미 서태룡
  7. 우리말 역순 사전 유재원
  8. 한국정보관리학회지 v.12 no.2 정보검색 연구를 위한 KRIST 테스트 컬렉션의 개발 이준호(외)
  9. Text Retrieval and Document Databases Asford,J.;Willett,P.
  10. Information Processing and Management v.19 Automatic spelling correction using a trigram similarity measure Angell,R.C.;Freund,G.E.;Willett,P.
  11. Information Retrieval : Data Structures & Algorithms Lexical analysis and stoplists Fox,C.;Frakes,W.B.(ed.);Baeza-Yates,R.(ed.)
  12. Information Retrieval : Data Structures & Algorithms Stemming algorithms Frakes,W.B.;Frakes,W.B.(ed.);Baeza-Yates,R.(ed.)
  13. Information Technology : Research and Development v.1 Online identification of word variants and arbitrary truncation searching using a string similarity measure Freund,G.E.;Willett,P.
  14. Program v.24 PHONIX : the algorithm Gadd,T.N.
  15. Journal of the American Society for Information Science v.42 How effective is suffixing Harman,D.
  16. Journal of the American Society for Information Science v.47 Stemming algorithms : a case study for detailed eveluation Hull,D.A.
  17. ACM Computing Surveys v.24 Techniques for automatically correcting words in text Kukich,K.
  18. Journal of Information Science v.3 An evaluation of some conflation algorithms for information retrieval Lennon,M.;Pierce,D.S.;Tarry,B.D.;Willet,P.
  19. Mechanical Translation and Computational Linguistics v.11 Development of a stemming algorithm Lovins,J.B.
  20. IBM Journal of Research and Development v.2 The automatic creation of literature abstracts Luhn,H.P.
  21. Literary and Linguistic Computing v.5 Processing of documents and queries in a Slovene language free text retrieval system Popovic,M.;Willett,P.
  22. Journal of the American Society for Information Science v.43 The effectiveness of stemming for natural language access to Slovene textual data Popovic,M.;Willett,P.
  23. Program v.14 An algorithm for suffix stripping Porter,M.F.
  24. Information Processing and Management v.24 Term weighting approaches in automatic text retrieval Salton,G.;Buckley,C.
  25. Journal of the American Society for Information Science v.44 Stemming of French words-based on grammatical categories Savoy,J.
  26. Journal of Documentation v.52 A stemming algorithm for Latin text databases Schinke,R.;Greengrass,M.;Robertson,A.M.;Willett,P.
  27. Improving Subject Retrieval in Online Catalogues Walker,S.;Jones,R.M.