DOI QR코드

DOI QR Code

Efficient Topic Modeling by Mapping Global and Local Topics

전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안

  • 최호창 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 경영대학 경영정보학부)
  • Received : 2017.07.21
  • Accepted : 2017.09.19
  • Published : 2017.09.30

Abstract

Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out. As the result, big data analysis is expected to be performed by demanders of analysis themselves. Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document. Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents. However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling. In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.

최근 빅데이터 분석 수요의 지속적 증가와 함께 관련 기법 및 도구의 비약적 발전이 이루어지고 있으며, 이에 따라 빅데이터 분석은 소수 전문가에 의한 독점이 아닌 개별 사용자의 자가 수행 형태로 변모하고 있다. 또한 전통적 방법으로는 분석이 어려웠던 비정형 데이터의 활용 방안에 대한 관심이 증가하고 있으며, 대표적으로 방대한 양의 텍스트에서 주제를 도출해내는 토픽 모델링(Topic Modeling)에 대한 연구가 활발히 진행되고 있다. 전통적인 토픽 모델링은 전체 문서에 걸친 주요 용어의 분포에 기반을 두고 수행되기 때문에, 각 문서의 토픽 식별에는 전체 문서에 대한 일괄 분석이 필요하다. 이로 인해 대용량 문서의 토픽 모델링에는 오랜 시간이 소요되며, 이 문제는 특히 분석 대상 문서가 복수의 시스템 또는 지역에 분산 저장되어 있는 경우 더욱 크게 작용한다. 따라서 이를 극복하기 위해 대량의 문서를 하위 군집으로 분할하고, 각 군집별 분석을 통해 토픽을 도출하는 방법을 생각할 수 있다. 하지만 이 경우 각 군집에서 도출한 지역 토픽은 전체 문서로부터 도출한 전역 토픽과 상이하게 나타나므로, 각 문서와 전역 토픽의 대응 관계를 식별할 수 없다. 따라서 본 연구에서는 전체 문서를 하위 군집으로 분할하고, 각 하위 군집에서 대표 문서를 추출하여 축소된 전역 문서 집합을 구성하고, 대표 문서를 매개로 하위 군집에서 도출한 지역 토픽으로부터 전역 토픽의 성분을 도출하는 방안을 제시한다. 또한 뉴스 기사 24,000건에 대한 실험을 통해 제안 방법론의 실무 적용 가능성을 평가하였으며, 이와 함께 제안 방법론에 따른 분할 정복(Divide and Conquer) 방식과 전체 문서에 대한 일괄 수행 방식의 토픽 분석 결과를 비교하였다.

Keywords

References

  1. AlSumait, L., D. Barbara and C. Domeniconi, "On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking," 2008 Eighth IEEE International Conference on Data Mining, (2008), 1-12.
  2. Blei, D. M. and J. D. Lafferty, "Dynamic Topic Models," Proceedings of the 23rd International Conference on Machine Learning, (2006), 113-120.
  3. Byun, S., D. Lee, and N. Kim, "Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria - Focus on a Hotel Information Site," Journal of Intelligence and Information Systems, Vol.22, No.3 (2016), 23-43. https://doi.org/10.13088/jiis.2016.22.3.023
  4. Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol.41, No.6 (1990), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  5. Forman, G. and B. Zhang, "Distributed Data Clustering can be Efficient and Exact," ACM SIGKDD Explorations Newsletter, Vol.2, No.2 (2000), 34-38. https://doi.org/10.1145/380995.381010
  6. Gartner, Gartner's 2015 Hype Cycle for Emerging Technologies Identifies the Computing Innovations that Organizations Should Monitor, Gartner, 2015. Available at http://www.gartner.com/newsroom/id/3114217 (Accessed 19 June, 2017).
  7. Han, J., J. Pei and M. Kamber, Data Mining: Concepts and Techniques, Elsevier, Amsterdam, 2011.
  8. Hotho, A., A. Nurnberger and G. Paass, "A Brief Survey of Text Mining," Ldv Forum, Vol. 20, No. 1 (2005), 1-37.
  9. IDC, Big Data and Business Analytics Revenues forecast to reach $150.8 Billion this Year, Led by Banking and Manufacturing Investments, IDC, 2017. Available at http://www.idc.com/getdoc.jsp?containerId=prUS42371417 (Accessed 19 June, 2017).
  10. Kim, D. and N. Kim, "Mapping Categories of Heterogeneous Sources using Text Analytics," Journal of Intelligence and Information Systems, Vol.22, No.4 (2016), 193-215. https://doi.org/10.13088/jiis.2016.22.4.193
  11. Kim, N., D. Lee, H. Choi and W. X. S. Wong, "Investigations on Techniques and Applications of Text Analytics," The Journal of The Korean Institute of Communication Sciences, Vol.42, No.2 (2017), 471-492. https://doi.org/10.7840/kics.2017.42.2.471
  12. Koll, M. B., "WEIRD: An Approach to Concept-Based Information Retrieval," ACM SIGIR Forum, Vol.13, No.4 (1979), 32-50. https://doi.org/10.1145/1095366.1095368
  13. Lee, D., H. Choi and N. Kim, "A Method for Evaluating News Value based on Supply and Demand of Information using Text Analysis," Journal of Intelligence and Information Systems, Vol.22, No.4 (2016), 45-67. https://doi.org/10.13088/JIIS.2016.22.4.045
  14. Liang, Z. and P. Chen, "Delta-Density based Clustering with a Divide-and-Conquer Strategy: 3DC Clustering," Pattern Recognition Letters, Vol.73, (2016), 52-59. https://doi.org/10.1016/j.patrec.2016.01.009
  15. Livermore, M. A., A. Riddell and D. Rockmore, "Agenda Formation and the US Supreme Court: A Topic Model Approach," Arizona Law Review, (2016), Forthcoming.
  16. McCallum, A., K. Nigam and L. H. Ungar, "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching," Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2000), 169-178.
  17. Mei, Q. and C. X. Zhai, "Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining," Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, (2005), 198-207.
  18. Mooney, R. J. and R. Bunescu, "Mining Knowledge from Text using Information Extraction," ACM SIGKDD Explorations, Vol.7, No.1 (2006), 3-10.
  19. Salton, G., The SMART Retrieval System-Experiments in Automatic Document Processing, Prentice-Hall, New Jersey, 1971.
  20. Salton, G., A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11 (1975), 613-620. https://doi.org/10.1145/361219.361220
  21. Sebastiani, F., "Classification of Text, Automatic," The Encyclopedia of Language and Linguistics, Vol.14, (2006), 457-462.
  22. Song, Y., J. Du and L. Hou, "A Topic Detection Approach Based on Multi-Level Clustering," Proceeding of the 31st Chines Control Conference, (2012), 3834-3838.
  23. Steyvers, M. and T. Griffiths, Probabilistic Topic Models : Handbook of Latent Semantic Analysis, Psychology Press, Oxfordshire, 2007.
  24. Wang, J., H. Deng and J. Han, "Torpedo : Topic Periodicity Discovery from Text Data," Next-Generation Analyst III, (2015), 94990A- 94990A-10.
  25. Wang, L., P. Chen and L. Huang, "An Efficient Clustering Algorithm for Large-Scale Topical Web Pages," Proceedings of the 18th ACM Conference on Information and Knowledge Management, (2009), 1851-1854.
  26. Witten, I. H., Text Mining, Practical Handbook of Internet Computing, CRC Press, Florida, 2004.