Browse > Article
http://dx.doi.org/10.13088/jiis.2018.24.1.183

Construction of Event Networks from Large News Data Using Text Mining Techniques  

Lee, Minchul (Graduate School of Library and Information Science, Yonsei University)
Kim, Hea-Jin (Institute of the Study for the Korean Modernity, Yonsei University)
Publication Information
Journal of Intelligence and Information Systems / v.24, no.1, 2018 , pp. 183-203 More about this Journal
Abstract
News articles are the most suitable medium for examining the events occurring at home and abroad. Especially, as the development of information and communication technology has brought various kinds of online news media, the news about the events occurring in society has increased greatly. So automatically summarizing key events from massive amounts of news data will help users to look at many of the events at a glance. In addition, if we build and provide an event network based on the relevance of events, it will be able to greatly help the reader in understanding the current events. In this study, we propose a method for extracting event networks from large news text data. To this end, we first collected Korean political and social articles from March 2016 to March 2017, and integrated the synonyms by leaving only meaningful words through preprocessing using NPMI and Word2Vec. Latent Dirichlet allocation (LDA) topic modeling was used to calculate the subject distribution by date and to find the peak of the subject distribution and to detect the event. A total of 32 topics were extracted from the topic modeling, and the point of occurrence of the event was deduced by looking at the point at which each subject distribution surged. As a result, a total of 85 events were detected, but the final 16 events were filtered and presented using the Gaussian smoothing technique. We also calculated the relevance score between events detected to construct the event network. Using the cosine coefficient between the co-occurred events, we calculated the relevance between the events and connected the events to construct the event network. Finally, we set up the event network by setting each event to each vertex and the relevance score between events to the vertices connecting the vertices. The event network constructed in our methods helped us to sort out major events in the political and social fields in Korea that occurred in the last one year in chronological order and at the same time identify which events are related to certain events. Our approach differs from existing event detection methods in that LDA topic modeling makes it possible to easily analyze large amounts of data and to identify the relevance of events that were difficult to detect in existing event detection. We applied various text mining techniques and Word2vec technique in the text preprocessing to improve the accuracy of the extraction of proper nouns and synthetic nouns, which have been difficult in analyzing existing Korean texts, can be found. In this study, the detection and network configuration techniques of the event have the following advantages in practical application. First, LDA topic modeling, which is unsupervised learning, can easily analyze subject and topic words and distribution from huge amount of data. Also, by using the date information of the collected news articles, it is possible to express the distribution by topic in a time series. Second, we can find out the connection of events in the form of present and summarized form by calculating relevance score and constructing event network by using simultaneous occurrence of topics that are difficult to grasp in existing event detection. It can be seen from the fact that the inter-event relevance-based event network proposed in this study was actually constructed in order of occurrence time. It is also possible to identify what happened as a starting point for a series of events through the event network. The limitation of this study is that the characteristics of LDA topic modeling have different results according to the initial parameters and the number of subjects, and the subject and event name of the analysis result should be given by the subjective judgment of the researcher. Also, since each topic is assumed to be exclusive and independent, it does not take into account the relevance between themes. Subsequent studies need to calculate the relevance between events that are not covered in this study or those that belong to the same subject.
Keywords
event detection; latent Dirichlet allocation (LDA); natural language processing (NLP); text mining; topic modeling;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Atefeh, F., and W. Khreich., "A survey of techniques for event detection in twitter," Computational Intelligence, Vol. 31, No. 1 (2015), 132-164.
2 Bae, J, H., N. G. Han and M Song, "Twitter Issue Tracking System by Topic Modeling Techniques," Journal of Intelligence and Information Systems, Vol. 20, No. 2 (2014), 109-122.   DOI
3 Blei, D. M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet allocation," Journal of machine Learning research, Vol. 3, No. 1 (2003), 993-1022.
4 Bouma, G. "Normalized (pointwise) mutual information in collocation extraction," Proceedings of the Biennial GSCL Conference Vol. 156. (2009), 31-40.
5 Chae S. H., J. I. Lim and J Kang, "A Comparative Analysis of Social Commerce and Open Market Using User Reviews in Korean Mobile Commerce," Journal of Intelligence and Information Systems, Vol. 21, No. 4 (2015), 53-77.   DOI
6 Goldberg, Y., and O. Levy, "Word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method," arXiv preprint arXiv:1402.3722. 2014.
7 Ha-Thuc, V., Y. Mejova, C. Harris, and P. Srinivasan, "A relevance-based topic model for news event tracking." Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, (2009) 764-765.
8 He, Q., K. Chang, E. P. Lim, and J. Zhang, "Bursty feature representation for clustering text streams," Proceedings of the 2007 SIAM International Conference on Data Mining, (2007), 491-496.
9 Kleinberg, J. "Bursty and hierarchical structure in streams," Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (2002), 91-101.
10 Jeong, H., "A Study on Ontology and Topic Modeling-based Multi-dimensional Knowledge Map Services," Journal of Intelligence and Information Systems, Vol. 21, No. 4 (2015), 79-92.   DOI
11 Kumaran, G., and J. Allan, "Text classification and named entities for new event detection," Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. (2004), 297-304.
12 Lee, J. Y., "A Study on Relative Mutual Information Coefficients," Journal of the Korean Society for Library and Information Science, Vol. 34., No. 4 (2003), 177-198.
13 Oh, H. J., B. H. Yun, C. J. Yoo, and Y. Kim, "Trend Analysis using Spatial-Temporal Visualization of Event Information based on Social Media," Journal of Internet Computing and Services, Vol. 15, No. 6 (2014), 65-75.   DOI
14 Qian, S., T. Zhang, C. Xu, and J. Shao, "Multi-modal event topic model for social event analysis." IEEE Transactions on Multimedia, Vol. 18, No. 2 (2016), 233-246.   DOI
15 Salton, G. "Automatic text processing. Reading." MA: Addison-Wesley. 1989.
16 Tsolmon, B. "Extracting Social Events based on LDA Topic Model with Timeline and User Behaviour Analysis in Twitter Corpus," MS Thesis, Chonbuk University, 2013.
17 Van de Cruys, T. "Two multivariate generalizations of pointwise mutual information," Proceedings of the Workshop on Distributional Semantics and Compositionality, (2011), 16-20.