Selection of Korean General Vocabulary for Machine Readable Dictionaries

자연언어처리용 전자사전을 위한 한국어 기본어휘 선정

  • 배희숙 (한국과학기술원 전문용언언어공학연구센터) ;
  • 이주호 (한국과학기술원 전문용언언어공학연구센터) ;
  • 시정곤 (한국과학기술원 전문용언언어공학연구센터) ;
  • 최기선 (한국과학기술원 전문용언언어공학연구센터)
  • Published : 2003.06.01

Abstract

According to Jeong Ho-seong (1999), Koreans use an average of only 20% of the 508,771 entries of the Korean standard unabridged dictionary. To establish MRD for natural language processing, it is necessary to select Korean lexical units that are used frequently and are considered as basic words. In this study, this selection process is done semi-automatically using the KAIST large corpus. Among about 220,000 morphemes extracted from the corpus of 40,000,000 eojeols, 50,637 morphemes (54,797 senses) are selected. In addition, the coverage of these morphemes in various texts is examined with two sub-corpora of different styles. The total coverage is 91.21 % in formal style and 93.24% in informal style. The coverage of 6,130 first degree morphemes is 73.64% and 81.45%, respectively.

Keywords