DOI QR코드

DOI QR Code

News Topic Extraction based on Word Similarity

단어 유사도를 이용한 뉴스 토픽 추출

  • 김동욱 (숭실대학교 대학원 컴퓨터학과) ;
  • 이수원 (숭실대학교 소프트웨어학부)
  • Received : 2017.01.26
  • Accepted : 2017.08.08
  • Published : 2017.11.15

Abstract

Topic extraction is a technology that automatically extracts a set of topics from a set of documents, and this has been a major research topic in the area of natural language processing. Representative topic extraction methods include Latent Dirichlet Allocation (LDA) and word clustering-based methods. However, there are problems with these methods, such as repeated topics and mixed topics. The problem of repeated topics is one in which a specific topic is extracted as several topics, while the problem of mixed topic is one in which several topics are mixed in a single extracted topic. To solve these problems, this study proposes a method to extract topics using an LDA that is robust against the problem of repeated topic, going through the steps of separating and merging the topics using the similarity between words to correct the extracted topics. As a result of the experiment, the proposed method showed better performance than the conventional LDA method.

토픽 추출은 문서 집합으로부터 그 문서 집합을 대표하는 토픽을 자동 추출하는 기술이며 자연어 처리의 중요한 연구 분야이다. 대표적인 토픽 추출 방법으로는 잠재 디리클레 할당과 단어 군집화 기반 토픽 추출방법이 있다. 그러나 이러한 방법의 문제점으로는 토픽 중복 문제와 토픽 혼재 문제가 있다. 토픽 중복 문제는 특정 토픽이 여러 개의 토픽으로 추출되는 문제이며, 토픽 혼재 문제는 추출된 하나의 토픽 내에 여러 토픽이 혼재되어 있는 문제이다. 이러한 문제를 해결하기 위하여 본 연구에서는 토픽 중복 문제에 대해 강건한 잠재 디리클레 할당으로 토픽을 추출하고 단어 간 유사도를 이용하여 토픽 분리 및 토픽 병합의 단계를 거쳐 최종적으로 토픽을 보정하는 방법을 제안한다. 실험 결과 제안 방법이 잠재 디리클레 할당 방법에 비해 좋은 성능을 보였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Wikipedia, topic model, http://en.wikipedia.org/wiki/Topic_model, 2015.
  2. Landauer, T. K., Foltz, P. W., & Laham, D., An introduction to latent semantic analysis, Discourse processes, 25(2-3), pp. 259-284, 1988.
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I., Latent Dirichlet Allocation, The Journal of Machine Learning Research, 3, pp. 993-1022, 2003.
  4. Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., ... & Zeng, J., Peacock: Learning long-tail topic features for industrial applications. arXiv preprint arXiv:1405.4402, 2014.
  5. Noh, J., Lee, S., Extracting and Evaluating Topics by Region, Multimedia Tools and Application, 75(20), 2016.
  6. Wikipedia, LDA, https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation, 2015.
  7. Dumais, S. T., Latent semantic analysis, Annual review of information science and technology, 38(1), pp. 188-230, 2004. https://doi.org/10.1002/aris.1440380105
  8. Hofmann, T., Probabilistic latent semantic analysis, Proc. of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 289-296, Morgan Kaufmann, 1999.
  9. Hofmann, T., Probabilistic latent semantic indexing. Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, ACM, 1999.
  10. Kazama, J. I., De Saeger, S., Kuroda, K., Murata, M., & Torisawa, K., A Bayesian method for robust estimation of distributional similarities, Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp.247-256, Association for Computational Linguistics, 2010.
  11. Wikipedia, perplexity, [Online]. Available: https://en.wikipedia.org/wiki/Perplexity, 2015.
  12. [Online]. Available: http://media.daum.net/netizen/hotlivenation/
  13. [Online]. Available: http://nlp.stanford.edu/software/tmt/tmt-0.4/
  14. Wikipedia, Expectation-maximization algorithm, [Online]. Available: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm