DOI QR코드

DOI QR Code

Automatic Construction of Reduced Dimensional Cluster-based Keyword Association Networks using LSI

LSI를 이용한 차원 축소 클러스터 기반 키워드 연관망 자동 구축 기법

  • 유한묵 (서울시립대학교 전자전기컴퓨터공학과) ;
  • 김한준 (서울시립대학교 전자전기컴퓨터공학과) ;
  • 장재영 (한성대학교 컴퓨터공학부)
  • Received : 2017.05.30
  • Accepted : 2017.09.14
  • Published : 2017.11.15

Abstract

In this paper, we propose a novel way of producing keyword networks, named LSI-based ClusterTextRank, which extracts significant key words from a set of clusters with a mutual information metric, and constructs an association network using latent semantic indexing (LSI). The proposed method reduces the dimension of documents through LSI, decomposes documents into multiple clusters through k-means clustering, and expresses the words within each cluster as a maximal spanning tree graph. The significant key words are identified by evaluating their mutual information within clusters. Then, the method calculates the similarities between the extracted key words using the term-concept matrix, and the results are represented as a keyword association network. To evaluate the performance of the proposed method, we used travel-related blog data and showed that the proposed method outperforms the existing TextRank algorithm by about 14% in terms of accuracy.

본 논문은 기존의 TextRank 알고리즘에 상호정보량 척도를 결합하여 군집 기반에서 키워드 추출하는 LSI-based ClusterTextRank 기법과 추출된 키워드를 Latent Semantic Indexing(LSI)을 이용한 연관망 구축 기법을 제안한다. 제안 기법은 문서집합을 단어-문서 행렬로 표현하고, 이를 LSI를 이용하여 저차원의 개념 공간으로 차원을 축소한다. 그 다음 k-means 군집화 알고리즘을 이용하여 여러 군집으로 나누고, 각 군집에 포함된 단어들을 최대신장트리 그래프로 표현한 후 이에 근거한 군집 정보량을 고려하여 키워드를 추출한다. 그리고나서 추출된 키워드들 간에 유사도를 LSI 기법을 통해 구한 단어-개념 행렬을 이용하여 계산한 후, 이를 키워드 연관망으로 활용한다. 제안 기법의 성능을 평가하기 위해 여행 관련 블로그 데이터를 이용하였으며, 제안 기법이 기존 TextRank 알고리즘보다 키워드 추출의 정확도가 약 14% 가량 개선됨을 보인다.

Keywords

Acknowledgement

Supported by : 한국연구재단, 한성대학교

References

  1. K.-S. Hasan and V. Ng, "Automatic keyphrase extraction: A survey of the state of the art," Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1262-1273, 2014.
  2. L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank citation ranking: bringing order to the web," Stanford InfoLab, 1999.
  3. R. Mihalcea, and P. Tarau, "TextRank - bringing order into texts," Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), 2004.
  4. Y.-J. Lee, S.-D. Kim, S.-H. Kang, and H.-G. Cho, "Characteristics Analysis on Keyword Network Obtained from Twitter," Proc. of the 42th KISS spring conference, pp. 227-229, 2015. (in Korean)
  5. N. Timme, W. Alford, B. Flecker, and J.-M. Beggs, "Multivariate information measures: an experimentalist's perspective," arXiv preprint arXiv:1111.6857, 2011.
  6. C.-H. Ding, "A similarity-based probability model for latent semantic indexing," Proc. of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 1999). pp. 58-65, 1999.
  7. H.-M. Yoo and H.-J. Kim, "Cluster-based keyword Ranking Technique," Proc. of the KIPS 2016 fall conference, Vol. 23, No. 2, 2016. (in Korean)
  8. Wikipedia. (2017, February 14). Discounted cumulative gain [Online]. Available: https://en.wikipedia.org/wiki/Discounted_cumulative_gain (downloaded 2017, April 10)
  9. Y. Wang, L. Wang, Y. Li, D. He, W. Chen, and T.-Y. Liu, "A theoretical analysis of NDCG ranking measures," Proc. of the 26th Annual Conference on Learning Theory (COLT 2013), 2013.
  10. D.-M. Blei, A.-Y. Ng, and M.-I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, pp. 993-1022, 2003.