Browse > Article
http://dx.doi.org/10.13088/jiis.2013.19.3.001

A Proposal of a Keyword Extraction System for Detecting Social Issues  

Jeong, Dami (Graduate School of Convergence Science and Technology, Seoul National University)
Kim, Jaeseok (Graduate School of Convergence Science and Technology, Seoul National University)
Kim, Gi-Nam (Department of Digital Media, Ajou University)
Heo, Jong-Uk (Department of Web Science, Korea Advanced Institute of Science and Technology)
On, Byung-Won (Advanced Institutes of Convergence Technology, Seoul National University)
Kang, Mijung (Advanced Institutes of Convergence Technology, Seoul National University)
Publication Information
Journal of Intelligence and Information Systems / v.19, no.3, 2013 , pp. 1-23 More about this Journal
Abstract
To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.
Keywords
Topic Modeling; Generative Model; Matching; Text Mining; Social Issue Keywords; Social Issue Filtering; News Articles; Time Series Keyword Visualization;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Sun, B., P. Mitra, H. Zha, C. Giles, and J. Yen, "Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents," Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, (2007), 199-206.
2 Wagner, C., Topic Models, DIGITAL-Institute of Information and Communication Technologies. Available at http://www.slideshare.net/clauwa/topic-models-5274169(Accessed 13 September, 2013).
3 Fei-Fie, L. and P. Perona, "A Bayesian hierarchical model for learning natural scene categories," IEEE Computer Vision and Pattern Recognition, Vol.2(2005), 524-531.
4 Fulton, S. and J. Fulton, HTML5 Canvas, O'Reilly Media, Inc., The first edition, 2012.
5 JGibbLDA-A Java Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference. Available at http://jgibblda.sourceforge.net (Accessed 13 September, 2013).
6 Kang, S., Korean Lexical Analysis. Available at http://nlp.kookmin.ac.kr/HAM/kor/ham-intr.html(Accessed 13 September, 2013).
7 Kam, M. and M. Song, "A Study on Differences of Contents and Tones of Arguments among Newspapers using Text Mining Analysis," Journal of Intelligence and Information Systems, Vol.18, No.3(2012), 53-77.   과학기술학회마을
8 Liu, B., Sentiment Analysis and Opinion Mining (Synthesis Lectures on Human Language Technologies), Morgan and Claypool Publishers, 2012.
9 Korean Integrated News Database Systems(KINDS). Available at http://www.kinds.or.kr(Accessed 13 September, 2013).
10 Lee, C., J. Hur, H. Oh, H. J Kim, P. Ryu, and H. K. Kim, "Technology Trends of Issue Detection and Predictive Analysis on Social Big Data," Electronics and Telecommunications Research Institute, Vol.28, No.1(2013), 62-71.
11 Misra, H., F. Yvon, J. Jose, and O. Cappe, "Text Segmentation via Topic Modeling : An Analytical Study," Proceedings of International Conference on Information and Knowledge Management(CIKM), (2009), 1553-1556.
12 Recorded Future, Web Intelligence for Business Decisions. Available at https://www.recordedfuture.com (Accessed 13 September, 2013).
13 Rosen-Zvi, M., T. Griffiths, M. Steyvers, and P. Smith., "The author-topic model for authors and documents," Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, (2004), 487-494.
14 Blei, D., "Probabilistic Topic Models," Communications of the ACM, Vol.55, No.4(2012), 77-84.   DOI
15 Aggarwal, C. and C. Zhai, Mining Text Data, Springer, 2012.
16 Blei, D., A. Ng, M. Jordan, and J. Lafferty, "Latent Dirichlet Allocations," Journal of Machine Learning Research, Vol.3, No.4-5(2003), 993-1022.
17 Blei, D. and J. Lafferty, "Dynamic topic models," International Conference on Machine Learning, (2006), 113-120.
18 Dalvi, N., R. Kumar, B. Pang, and A. Tomkins, "Matching Reviews to Objects using a Language Model," Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2009), 609- 618.