An Extraction Algorithm of Compound Field-associated Terms for Korean Document Classifications

한글문서 분류용으로 이용할 복합어로 구성된 분야연상어의 추출법

  • 이상곤 (전주대학교 정보기술공학부)
  • Published : 2005.07.01

Abstract

Field-associated Terms itself have field Information. So, they determine field of document just like when human being perceives field. In case of Korean, we organized and experimented them by collecting approximately IS,999 document banks that are classified into 180 fields. We obtained high precision of extraction that 88,782 single field-associated terms are contracted into 8,405 ones thus recording compression rate as approximately 9$\%$ and recall as above 0.77 (average 0.85), precision as above 0.90 (average 0.94). By applying established field-associated terms to initial determination for document classification and comparing it with filed determination by human being, we got correct answers above approximately 90$\%$. We can use results of research as fundamental research for initial stage and apply it document retrieval between multilingual environment thus utilizing it as fundamental research for multilingual information retrieval.

분야연상어는 어휘자체가 분야정보를 가지므로 인간이 분야를 인지할 때와 유사하게 문서의 분야를 판단한다. 한국어의 경우 180분야로 분류된 약 IS,000개의 문서뱅크를 수집하여 구축 $\cdot$실험한 결과 88,782개의 단일 분야연상어가 8,405개로 전체의 약 9$\%$로 압축되며, 재현율 0.77 이상(평균 0.85), 정확률 0.90 이상(평균 0.94)의 높은 추출 정밀도를 얻었다. 구축한 분야연상어를 문서분류의 초기결정에 적용하여 인간에 의한 분야결정과 비교한 결과 약 90$\%$이상의 정답률을 얻었다. 연구결과를 문서분류의 초기단계에 관한 기초연구로 이용하고, 다언어(multilingual) 간의 문서검색에 적용하여 다국어 정보검색에 대한 기초 연구로 이용할 수 있다.

Keywords

References

  1. Edwin Williams, On the Notions 'Lexically Related and Head of a Word,' Linguistic Inquiry, Vol. 12, No.2, pp. 245-274, 1981
  2. Fumiyo Fukumoto et aI., 'Automatic Clustering of Articles Using Dictionary Definition,' Transactions of Information Processing Society of Japan, Vol. 37, No. 10, pp. 1789-1799, 1996 (in Japanese)
  3. M. J. Blosseville et aI., 'Automatic Document Classification: Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together,' Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '92), pp. 51-58, 1992 https://doi.org/10.1145/133160.133175
  4. Masami Hara et aI., 'Keyword Extraction Using a Text Format and Word Importance in a Specific Field,' Transactions of Information Processing Society of Japan, Vol. 38, No.2, pp. 299-309, 1997 (in Japanese)
  5. Mochizuki, H., Makoto, I., and Okumura, M. 'Passage-Level Document Retrieval Using Lexical Chains,' Journal of Natural Language Processing, Vol. 6, No.3, pp. 101-126, 1999 (in Japanese) https://doi.org/10.5715/jnlp.6.3_101
  6. Naoyuki Nomura, 'ConceptBase- A NL -based IT Solution Core,' Proceedings of the 1999, the 18th International Conference on Computer Processing of Oriental Language (ICCPOL '99), p. 235, 1999
  7. Norbert Fuhr, 'Models for Retrieval with Probabilistic Indexing,' Information Processing & Management, Vol. 25, No.1, pp. 55-72, 1989 https://doi.org/10.1016/0306-4573(89)90091-5
  8. Salton, G. and McGill, M. J., 'Introduction of Modem Information Retrieval,' McGraw-Hill Book Company, 1983
  9. Salton, G., 'Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer,' Addison-Wesley Publishing Company, 1989
  10. Tokunaga, T. and Iwayarna, M., 'Text Categorization based on Weighted Inverse Document Frequency,' Natural Language Processing, Vol. 100, No.5, 1994. (in Japanese)
  11. Tsuji, T., Nigazawa, H., Okada, M., & Aoe, J., 'Early Field Recognition by Using Field Association Words,' Paper Presented at the Proceedings of the 18th International Conference on Computer Processing of Oriental Language (ICCPOL '99), 1999
  12. Yoshitaka Hayashi et aI., 'Efficient Method for Extracting Keywords of Compound Words Using Pattern Matching Machines,' Transactions of Information Processing Society of Japan, Vol. 38, No.4, pp. 815-825, 1997 (in Japanese)
  13. 남영신, 우리말 분류 사전, 성안당, 2001
  14. 이상곤, '분야연상어를 이용한 화제의 계속성과 전환성을 추적하는 단락분할 방법', 정보처리학회눈문지 (B), 제 10권, 제 1호, pp. 57-66, 2003
  15. 이상곤, 이완권, '분양연상어의 수집과 추출 알고리즘', 정보처리학회논문지(B), 제 10권, 제 3호, pp. 347-358, 2003
  16. 이상곤, '분야연상어를 이용한 화제분야의 계산방법과 단락검색', 정보처리학회논문지(B), 제 12권, 제 1호, pp. 57-68, 2005 https://doi.org/10.3745/KIPSTB.2005.12B.1.057