An Efficient Index Term Extraction Method in IR using Lexical Chains

정보검색에서 어휘체인을 이용한 효과적인 색인어 추출 방안

  • Kang, Bo-Yeong (Dept.of Computer Engineering, Kyungpook National University) ;
  • Lee, Sang-Jo (Dept.of Computer Engineering, Kyungpook National University)
  • 강보영 (경북대학교 컴퓨터공학과) ;
  • 이상조 (경북대학교 컴퓨터공학과)
  • Published : 2002.08.01

Abstract

In information retrieval or digital library, one of the most important factors is to find out the exact information which users need. In this paper, we present an efficient index term extraction method which makes it possible to guess the content of documents and get the information more exactly. To find out index terms in a document, we use lexical chains. Before generating lexical chains, we roughly disambiguate the senses of nouns in a document using specific concept, called semantic window. Semantic window is that we look ahead semantic relations of peripheral nouns and disambiguate the senses of nouns. After generating lexical chains with sense-disambiguated nouns, we find out strong chains by some metrics and extract index terms from a few strong chains. We evaluated our system, using results of a key phrase extraction system, KEA. This system works in general domains of documents Including Information Retrieval and Digital Library.

정보 검색(Information Retrieval)이나 디지털 도서관(Digital Library)과 같은 분야에서 가장 중요한 요소는 사용자가 필요로 하는 정보를 찾아주는 것이다. 이를 위해서 사용자가 사용하는 장치는 사용자의 의도뿐만 아니라 문서의 내용 또한 잘 파악하여야 한다. 본 논문은 문서의 의미적인 내용을 파악하는데 도움을 주는 효과적인 키워드 추출 시스템을 제안한다. 제안된 시스템은 문서에서 추출된 명사들의 의미(sense)를 결정(disambiguation)하고, 의미가 결정된 명사로 어휘체인을 생성한다. 특정 척도를 이용하여 강한 체인을 선별하고, 몇 개의 강한 체인에서 키워드들을 추출한다. 문서에서 사용된 명사들의 실제 센스를 결정하는 단계에서 semantic window라는 개념을 제안한다. 이것은 주변 명사들과의 의미관계를 미리 살펴보고, 문서내의 명사들의 센스를 결정하는 것이다. 본 시스템의 성능을 검증하기 위하여, 주요 구(key phrase) 추출 시스템인 KEA의 성능과 비교 분석하였다. 본 시스템은 정보 검색과 디지털 도서관을 포함한 범용적인 도메인에서 유용하게 사용될 수 있을 것으로 판단된다.

Keywords

References

  1. Lancaster, F.W., and Warner, A.J., Information Retrieval Today, Arlington, VA: Information Resources Press, 1993
  2. Moens, M.-F., Automatic Indexing and Abstracting of Document Texts, Kluwer Academic Publishers, 2000
  3. Hahn, U., 'Making unerstanders out of parsers: semantically driven parsing as a key concept for realistic text understanding applications,' International Journal of Intelligent Systems, Vol. 4, pp. 345-393, 1989 https://doi.org/10.1002/int.4550040307
  4. Lewis, D.D., and Sparck Jones, K., 'Natural language processing for information retrieval,' Communications of the ACM, Vol. 39, No. 1, 92-101, 1996 https://doi.org/10.1145/234173.234210
  5. Morris, J., and Hirst, G., 'Lexical cohesion computed by thesaural relations as an indicator of the structure of text,' Computational Linguistics, Vol. 17, No. 1, pp. 21-43, 1991
  6. Morris, J., 'Lexical cohesion, the thesaurus, and the structure of text,' Master's thesis, Department of Computer Science, University of Toronto, 1988
  7. Barzilay, R. and Elhadad, M., 'Using lexical chains for text summarization,' In the Proceedings of the ACL'97 Workshop on Intelligent Scalable Text Summarization, 1997
  8. Luhn, H.P., 'Statistical approach to mechanized encoding and searching of literary information,' IBM Journal of Research and Development, Vol. 1, No. 4, pp. 309-317, 1957 https://doi.org/10.1147/rd.14.0309
  9. Bookstein, A., Klein, S.T., and Raita, T., 'Clumping properties of content-bearing words,' JASIS, Vol. 49, No. 2, pp. 102-114, 1998 https://doi.org/10.1002/(SICI)1097-4571(1998)49:2<102::AID-ASI2>3.0.CO;2-2
  10. Liddy, E.D., and Myaeng, S.H., 'DR-LINK's: linguistic-comceptual approach to document and detection,' The First Text REtreival Conference (TREC-1), pp. 113-129, 1993
  11. Burnett, M., Fisher, C., and Jones, K., 'In TEXT processing indexing in TREC-4,' The Fourth Text REtrieval Conference (TREC-4), pp. 287-294, 1996
  12. Salton, G., Singhal, A., Mitra, M. and Buckley, C., 'Automatic text structuring and summarization,' IP&M, Vol. 33, No. 2, 193-207, 1997 https://doi.org/10.1016/S0306-4573(96)00062-3
  13. Halliday, M.A.K., and Hasan, R., Cohesion in English, London: Longman, 1976
  14. Hasan, R., Coherence and Cohesive Harmony. In J. Flood (Ed.) Understanding Reading Comprehension, pp. 181-219, Newark, DE: IRA, 1984
  15. Al-Halimi, R. and Kazman, R., Temporal Indexing through Lexical Chaining. In fellbaum, C., ed., wordNet: An Electronic Lexical Database and Some of its Applications, Cambridge, MA: The MIT Press, 1998
  16. Frank, E., Paynter, G., Witten, I., Gutwin, C. and Nevill-Manning, C., 'Domain-specific keyphrase extraction,' In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan-Kaufmann, 668-673, 1999
  17. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 'KEA: Practical Automatic Keyphrase Extraction,' In Proceedings of Digital Libraries (99: The fourth ACM Conference on Digital Libraries), pp. 254-255, 1999
  18. Gale, W., Church, K., and Yarwsky, D., 'Estimation upper and lower bounds on the performance of word-sense disambiguation programs,' In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics(ACL-92), pp. 249-256, 1992 https://doi.org/10.3115/981967.981999