DOI QR코드

DOI QR Code

A Study on Keyword Extraction From a Single Document Using Term Clustering

용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구

  • 한승희 (서울여자대학교 사회과학대학 문헌정보학과)
  • Received : 2010.07.19
  • Accepted : 2010.08.11
  • Published : 2010.08.30

Abstract

In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, pf(paragraph frequency) and $tf{\times}ipf$(term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

이 연구에서는 용어 클러스터링을 이용하여 단일문서의 키워드를 추출하는 알고리즘을 제안하고자 한다. 단락단위로 분할한 단일문서를 대상으로 1차 유사도와 2차 분포 유사도를 산출하여 용어 클러스터링을 수행한 결과, 50단어 단락에서 2차 분포 유사도를 적용했을 때 가장 우수한 성능을 나타냈다. 이후, 용어 클러스터링결과를 이용하여 단일문서의 키워드를 추출하기 위해 단순빈도와 상대빈도의 조합을 통해 다양한 키워드 추출 공식을 도출, 적용한 결과, 단락빈도(pf)와 단어빈도$\times$역단락빈도($tf{\times}ipf$) 조건에서 가장 우수한 결과를 나타냈다. 이 결과를 통해, 본 연구에서 제안한 알고리즘은 좋은 키워드가 가져야 할 두 가지 조건인 주제성과 고른 빈도분포라는 측면에서 단일문서를 대상으로 효과적으로 키워드를 추출할 수 있음을 확인하였다.

Keywords

References

  1. 김수연, 정영미. 2006. 텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구. 정보관리학회지, 23(3): 147-165.(Su-Yeon Kim, & Young-Mee Chung. 2006. "An Experimental Study on Selecting Association Terms Using Text Mining Techniques." Journal of the Korea Society for Information Management, 23(3): 147-165.) https://doi.org/10.3743/KOSIM.2006.23.3.147
  2. 서은경. 1984. 용어의 자동분류에 관한 연구. 정보관리학회지, 1(1): 78-99.(Eun-Gyoung Seo. 1984. "A Study on Automatic Keyword Classification." Journal of the Korea Society for Information Management, 1(1): 78-99.)
  3. 유사라. 1999. 정보학연구와 분석방법론. 서울: 나남출판.(Sarah Yoo. 1999. Jeongbohakyeonguwa Bunseokbangbeopron. Seoul: Nanamchulpan.)
  4. 이성직, 김한준. 2009. TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법. 한국전자거래학회지, 14(4): 59-73.(Sungjick Lee, & Han-joon Kim. 2009. "Keyword Extraction from News Corpus using Modified TF-IDF." The Journal of Society for e-Business Studies, 14(4): 59-73.)
  5. 이재윤. 2007. 분포 유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구. 정보관리학회지, 24(4): 267-283.(Jae-Yun Lee. 2007. "Improving the Performance of Document Clustering with Distributional Similarities." Journal of the Korea Society for Information Management, 24(4): 267-283.)
  6. 이주호, 김학수. 2009. 의존관계를 이용한 단일문서의 키워드 추출. 2009 한국컴퓨터종합학술대회논문집, 36(1): 293-296.(Jooho Lee, & Harksoo Kim. 2009. "Keyword Extraction of Single Document using Dependency relation." 2009 Proceedings of KIISE, 36(1): 293-296.)
  7. 정영미. 2005. 정보검색연구. 서울: 구미무역.(Young-Mee Chung. 2005. Jeongbogeomseakyeongu. Seoul: kumimuyeok.)
  8. 정영미. 1993. 정보검색론. 서울: 구미무역.(Young-Mee Chung. 1993. Jeongbogeomseakron. Seoul: kumimuyeok.)
  9. 한승희, 정영미. 2004. 클러스터링 기법을 이용한 개별문서의 지식구조 자동 생성에 관한 연구. 정보관리학회지, 21(3): 251-267.(Seung-Hee Han, & Young-Mee Chung. 2004. "Automatic Generation of the Local Level Knowledge Structure of a Single Document Using Clustering Methods." Journal of the Korea Society for Information Management, 21(3): 251-267.) https://doi.org/10.3743/KOSIM.2004.21.3.251
  10. Al-Khalifa, Hend S., & Hugh C. Davis. 2006. "Folksonomies versus automatic keyword extraction: an empirical study." Proceedings of IADIS Web Applications and Research, 2: 132-143.
  11. Callan, James P. 1994. "Passage-level evidence on document retrieval." Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 302-310.
  12. Dagan, Ido, Lillian Lee, & Fernando Pereira. 1999. "Similarity-based models of cooccurrence probabilities." Machine Learning, 34(1-3): 43-69. https://doi.org/10.1023/A:1007537716579
  13. Hulth, A., Jussi Karlgren, Anna Jonsson, Henrik Bostrom, & Lars Asker. 2010. "Automatic Keyword Extraction Using Domain Knowledge." Lecture Notes in Computer Science, 2004/2010: 472-482.
  14. Kullback, Solomon. 1968. Information Theory and Statistics, 2nd ed. New York: Dover Books.
  15. Lee, Lillan. 1999. "Measures of distributional similarity." Proceedings of 37th Annual Meeting of the Association for Computational Linguistics, 25-32.
  16. Leweis, David D., & W. Bruce Croft. 1990. "Term clustering of syntactic phrases." Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 385-404.
  17. Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information Theory, 37(1): 145-151. https://doi.org/10.1109/18.61115
  18. Liu, M., Li, W., Wu Mingli, & Qin Lu. 2007. "Extractive summarization based on event term clustering." Proceedings of the ACL 2007, 185-188.
  19. Matzuo, Y., & M. Ishizuka. 2004. "Keyword extraction from a single document using word co-occurrence statistical information." International Journal on artificial Intelligence Tool, 13(1): 157-169. https://doi.org/10.1142/S0218213004001466
  20. Pereira, F., Naftali Tishby, & Lillian Lee. 1993. "Distributional clustering of English words." Proceedings of the 31st Annual Meeting of the ACL, 183-190.
  21. Plas, L. van der, V. Pallotta, M. Rajman, & H. Ghorbel. 2004. "Automatic keyword extraction from spoken text." Proceedings of the 4th International Conference on Language Resources and Evaluation 2004, 2205-2208.
  22. Sneath, P. H. A., and R. R. Sokal. 1973. Numerical Taxonomy. SF: Freeman.
  23. Sparck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. London: Butterworth&Co.
  24. Sparck Jones, K. 1972. "Automatic indexing." Journal of Documentation, 30(4): 393-432.
  25. Strehl, Alexander, Joydeep Ghosh, & Raymond Mooney. 2000. "Impact of similarity measures on web-page clustering." Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search(AAAI 2000), 58-64.
  26. Suzuki, Y., F. Fukumoto, Y. Sekiguchi. 1998. "Keyword extraction of radio news using term weighting with an encyclopedia and newspaper articles." Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 373-374.
  27. Tombros, Anastasios. 2002. The Effects of Query-based Hierarchical Clustering of Documents for Information Retrieval. Ph.D. diss., Cornell University.
  28. Turney, Peter D. 2000. "Learning algorithm for keyphrase extraction." Information Retrieval, 2(4): 303-36. https://doi.org/10.1023/A:1009976227802
  29. Weeds, J. E. 2003. Measures and Applications of Lexical Distributional Similarity. Ph. D. diss., University of Sussex.
  30. White, H. D., & B. C. Griffith. 1981. "Author cocitation: a literature measure of intellectual structure." Journal of the American Society for Information Science, 32: 163-171. https://doi.org/10.1002/asi.4630320302
  31. Witten, Ian H., Paynter, Gordon W., Frank, Eibe., Gutwin, Carl., & Nevill-Manning, Craig G. 1999. "KEA: practical automatic keyphrase extraction." Proceedings of the 4th ACM Conference on Digital Library, 254-255.
  32. Zobel, J., A. Moffat, R. Wilkinson, & R. Sacks-Davis. 1995. "Efficient Retrieval of Partial Documents." Information Processing and Management, 31(3): 36-377.

Cited by

  1. Analysis of the characteristics of expressway traffic information propagation using Twitter vol.20, pp.7, 2016, https://doi.org/10.1007/s12205-016-0781-1
  2. Intellectual structure of Korean theology 2000–2008: Presbyterian theological journals vol.39, pp.3, 2013, https://doi.org/10.1177/0165551512466972
  3. The Design and Implementation of OWL Ontology Construction System through Information Extraction of Unstructured Documents vol.19, pp.10, 2014, https://doi.org/10.9708/jksci.2014.19.10.023
  4. Analyzing Customer Feedback Differences between VOCs and External Channels vol.41, pp.3, 2018, https://doi.org/10.11627/jkise.2018.41.3.129