Unsupervised Noun Sense Disambiguation using Local Context and Co-occurrence

국소 문맥과 공기 정보를 이용한 비교사 학습 방식의 명사 의미 중의성 해소

  • 이승우 (포항공과대학교 정보통신연구소) ;
  • 이근배 (포항공과대학교 컴퓨터공학과)
  • Published : 2000.07.15

Abstract

In this paper, in order to disambiguate Korean noun word sense, we define a local context and explain how to extract it from a raw corpus. Following the intuition that two different nouns are likely to have similar meanings if they occur in the same local context, we use, as a clue, the word that occurs in the same local context where the target noun occurs. This method increases the usability of extracted knowledge and makes it possible to disambiguate the sense of infrequent words. And we can overcome the data sparseness problem by extending the verbs in a local context. The sense of a target noun is decided by the maximum similarity to the clues learned previously. The similarity between two words is computed by their concept distance in the sense hierarchy borrowed from WordNet. By reducing the multiplicity of clues gradually in the process of computing maximum similarity, we can speed up for next time calculation. When a target noun has more than two local contexts, we assign a weight according to the type of each local context to implement the differences according to the strength of semantic restriction of local contexts. As another knowledge source, we get a co-occurrence information from dictionary definitions and example sentences about the target noun. This is used to support local contexts and helps to select the most appropriate sense of the target noun. Through experiments using the proposed method, we discovered that the applicability of local contexts is very high and the co-occurrence information can supplement the local context for the precision. In spite of the high multiplicity of the target nouns used in our experiments, we can achieve higher performance (89.8%) than the supervised methods which use a sense-tagged corpus.

본 논문에서는 한국어 명사의 중의성 해소를 위해, 원시 말뭉치로부터 얻을 수 있는 지식원으로서 국소문맥을 정의하고 추출하는 방법을 제시한다. 동일한 국소 문맥을 갖는 서로 다른 명사는 그 의미가 유사하다는 직관을 바탕으로 대상 명사의 중의성 해소를 위해 대상명사를 포함하는 국소문맥과 동일한 국소문맥을 갖는 단어를 단서로 사용함으로써 학습 자료의 활용도를 높일 수 있고 빈도수가 적은 단어의 의미 중의성도 해결할 수 있으며, 용언의 확장을 통해 자료 부족 현상을 줄일 수 있다. 대상 명사는 동일한 국소문맥에 의한 단서들과의 최대 유사도 계산을 통해 그 의미가 결정된다. 두 단어간의 유사도는 WordNet으로부터 차용한 의미 계층 구조에서 두 단어가 가지는 개념 사이의 거리에 의해 계산된다. 최대 유사도를 계산하는 과정에서는 단서들의 중의성을 점차 줄여 나감으로써 유사도 계산의 속도를 향상시킬 수 있다. 대상 명사가 둘 이상의 국소문맥을 가질 때에는 각 국소문맥의 종류에 따른 가중치를 부여하여 국소문맥의 종류에 따른 의미제약의 차이를 구현하였다. 또 하나의 지식원으로서 사전 정의와 예문으로부터 공기정보를 얻고, 이를 국소문맥을 보완하기 위한 지식으로 사용하여 최선의 의미를 선택할 수 있도록 하였다. 실험을 통해, 제안하는 방법은 국소 문맥의 적용률이 높고, 공기 정보는 국소 문맥과 상호 보완적으로 사용되어 정확도를 높일 수 있음을 보였다. 본 방법을 실험한 결과, 사용된 단어의 의미 중의성이 크면서도, 기존의 의미 부착 말뭉치를 이용한 교사 학습 방식의 성능보다도 높은 정확도(89.8%)를 얻을 수 있었다.

Keywords

References

  1. Lin, Dekang, Using Syntactic Dependency as Local Context to Resolve Word-Sense Ambiguity., in Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics. Somerset, N.J : Association for Computational Linguistics, 1997 https://doi.org/10.3115/976909.979626
  2. Resnik, P., Selectional Preference and Sense Disambiguation. in Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How? , pp. 52-57, Somerset, N.J.: Association for Computational Linguistics, 1997
  3. Kelly, Edward F. and Philip J. Stone, Computer Recognition of English Word Senses, North-Holland, Amsterdam, 1975
  4. Guthrie, Joe A., Louise Guthrie, Yorick Wilks, and Homa Aidinejad, Subject-dependent co-occurrence and word sense disambiguation., in Proceedings of the 29th Annual Meeting, pp. 146-152, Berkeley, CA, June. Association for Computational Linguistics, 1991 https://doi.org/10.3115/981344.981363
  5. Lesk, Michael, Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. in Proceedings of the 1986 SIGDOC Conference, pp. 24-26, Toronto, Canada, June, 1986 https://doi.org/10.1145/318723.318728
  6. Walker, Donald E. and Amsler Robert, The use of machine-readable dictionaries in sub-language analysis., in Ralph Grishman and Richard Kittredge. 1986
  7. Wilks, Yorick A., Dan Fass, Cheng-Ming Guo, James E. MacDonald, Tony Plate, and Brian A. Slator. Providing machine tractable dictionary tools. in James Pustejovsky, editor, Semantics and the Lexicon. MIT Press, Cambridge, MA, 1990
  8. Agirre, E., and Rigau, G., Word-Sense Disambiguation Using Conceptual Density, in Proceedings of the 16th International Conference on Computational Linguistics, Somerset, N,J.: Association for Computational Linguistics, 1996 https://doi.org/10.3115/992628.992635
  9. Li, Xiaobin, Stan Szpakowicz and Stan Matwin, A WordNet-based algorithm for word sense disambiguation, in IJCAI'95, pp. 1368-1374, 1995
  10. Yarowsky, David, Word sense disambiguation using statistical models of Roget's categories trained on large corpora., in Proceedings of the 14th International Conference on Computational Linguistics, COLING'92, pp. 454-460, Nantes, France, August, 1992 https://doi.org/10.3115/992133.992140
  11. Miller, George A., Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas, WordNet: An on-line Lexical database, International Journal of Lexicography, 3(4), pp. 235-244, 1990 https://doi.org/10.1093/ijl/3.4.235
  12. Bruce, Rebecca and Janyce Wiebe. Word-sense disambiguation using decomposable models, in Proceedings of the 32nd Annual Meeting, pp. 139-145, Las Cruces, NM. Association for Computational Linguistics, 1994 https://doi.org/10.3115/981732.981752
  13. Leacock, Claudia, Geoffrey Towwell, and Ellen M. Voorhees. Corpus-based statistical sense resolution, in Proceedings of the ARPA Human Language Technology Workshop, San Francisco, Morgan Kaufmann, 1993 https://doi.org/10.3115/1075671.1075730
  14. McRoy, Susan W. Using multiple knowledge sources for word sense discrimination, Computational Linguistics, 18(1), pp.1-30, 1992
  15. Ng, Hwee Tou and Hian Beng Lee, Integrating multiple knowledge sources to disambiguation word sense: An examplar-based approach, in Proceedings of the 34th Annual Meeting, pp. 40-47, University of California, Santa Cruz, CA, June, Association for Computational Linguistics, 1996
  16. Ng, Hwee Tou and John Zelle, Corpus-Based Approaches to Semantic Interpretation in Natural Language Processing, in American Association for Artificial Intelligence, pp. 45-64, 1997
  17. Niwa, Yoshiki and Yoshihiko Nitta. Co-occurrence vectors from corpora vs distance vectors from dictionaries in Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, pp. 304-309, Kyoto, Japan, August, 1994 https://doi.org/10.3115/991886.991938
  18. Yarowsky, David. Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French in Proceedings of the 32nd Annual Meeting, pp. 88-95, Las Cruces, NM. Association for Computational Linguistics, 1994 https://doi.org/10.3115/981732.981745
  19. 김봉섭. 한-일 기계번역에서 의미 태깅된 말뭉치의 자동 생성 및 이를 이용한 명사의 의미 중의성 해소, 포항공과대학교 석사학위 논문, 1998
  20. Dagan, Ido and Alon Itai. Word sense disambiguation using a second language monolingual corpus, in Computational Linguistics, 20(4), pp.563-596, 1994
  21. Leacock, Claudia and Martin Chodorow. Using Corpus Statistics and WordNet Relations for Sense Identification, Computational Linguistics, 24(1), pp.147-165, 1998
  22. Schtze, Hinrich. Ambiguity in Natural Language Learning Computational and Cognitive Models, Ph.D. Dissertation, Stanford University, 1995
  23. Yarowsky, David. Unsupervised word sense disambiguation rivaling supervised methods, in Proceedings of the 33rd Annual Meeting, pp. 189-196, Cambridge, MA, June. Association for Computational Linguistics, 1995 https://doi.org/10.3115/981658.981684
  24. 문유진, 한국어 명사를 위한 WordNet의 설계와 구현, 정보과학회논문지(c) 제2권 제4호, 1996
  25. 조평옥, 한국어 명사의 의미 계층 구조 구축, 울산대학교 박사학위 논문, 1996
  26. Kilgarriff, Adam. I don't believe in word senses, manuscript https://doi.org/10.1023/A:1000583911091
  27. Choueka, Yaacov and Serge Lusignan. Disambiguation by short contexts, Computers and the Humanities, 19, pp. 147-158, 1985 https://doi.org/10.1007/BF02259530
  28. Leacock, Claudia, Geoffrey Towwell, and Ellen M. Voorhees. Towards building contextual representations of word senses using statistical models, in Corpus Processing for Lexical Acquisition. The MIT Press, chapter 6, pp. 97-113, 1996
  29. Jeongwon Cha, Geunbae Lee and Jong-Hyeok Lee. Generalized Unkown Morpheme Guessing for Hybrid POS Tagging of Korean, in Proceedings of the 6th Workshop on Very Large Corpora, COLING-ACL'98, pp. 85-93, 1998
  30. 이원일. 단일화 기반 범주 문법에 기반한 음성 한국어처리, 포항공과대학교 박사학위 논문, 1998
  31. Dunning, Ted. Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19(1), pp. 61-74, March, 1993
  32. 김민수. 그랜드 국어 사전;on-line version, 금성출판사, 1993
  33. 홍재성 외 9인, 현대 한국어 동사 구문 사전, 두산 동아, 1997
  34. 권혜진. 범주 문법과 논리 구조에 기반한 자연어 질의의 의미 분석, 포항공과대학교 석사학위 논문, 1997
  35. Resnik, P., Disambiguating noun groupings with respect to WordNet senses, in Third Workshop on Very Large Corpora. Association for Computational Linguistics, 1995
  36. 이호, 백대호, 임해창. 최소한의 코퍼스 정보를 이용한 단어 의미 중의성 해결 기법, 한국 정보 과학회 봄 학술발표논문집 24권 1호, 1997