Browse > Article
http://dx.doi.org/10.9728/dcs.2018.19.1.51

WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS  

Song, Ae-Rin (Department of IT Engineering, Sookmyung Women's University)
Park, Young-Ho (Department of IT Engineering, Sookmyung Women's University)
Publication Information
Journal of Digital Contents Society / v.19, no.1, 2018 , pp. 51-58 More about this Journal
Abstract
As the amount of users and data of NS explosively increased, research based on SNS Big data became active. In social mining, Latent Dirichlet Allocation(LDA), which is a typical topic model technique, is used to identify the similarity of each text from non-classified large-volume SNS text big data and to extract trends therefrom. However, LDA has the limitation that it is difficult to deduce a high-level topic due to the semantic sparsity of non-frequent word occurrence in the short sentence data. The BTM study improved the limitations of this LDA through a combination of two words. However, BTM also has a limitation that it is impossible to calculate the weight considering the relation with each subject because it is influenced more by the high frequency word among the combined words. In this paper, we propose a technique to improve the accuracy of existing BTM by reflecting semantic relation between words.
Keywords
Social Network Service; Natural Language Processing; Text Mining; Topic Model; Clustering;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 J. R. Firth, "A synopsis of linguistic theory," Studies in linguistic analysis, pp. 1930-1955, 1957.
2 S. Bird, "NLTK: the natural language toolkit," In Proceedings of the COLING/ACL on Interactive presentation sessions, Association for Computational Linguistics, pp. 69-72, July 2006.
3 Korea National Statistical Office. Analysis of Consumer Propensity using SNS Data. 2015.
4 H. Shim and K. Lim. "Research on the Effect of Different motivations on the Participation in SNSs," Journal of Digital Contents Society, Vol. 12, No. 3, pp. 383-390, 2011.   DOI
5 Statista. Number of Social Media Users. [Internet] Available:https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
6 Intel. What Happens in an Internet Minute? 2013. Available: https://newsroom.intel.com/press-kits/big-data-intelligence-begins-with-intel/
7 M. C. Yang and H. C. Rim, "Identifying interesting Twitter contents using topical analysis," Expert Systems with Applications, Vol. 41, No. 9, pp.4330-4336, July 2014.   DOI
8 A. Oulasvirta, E. Lehtonen, E. Kurvinen,, and M. Raento, "Making the ordinary visible in microblogs," Personal and ubiquitous computing, Vol. 14, No. 3, pp. 237-249, 2010.   DOI
9 S. H. Na, J. I. Kim, E. J. Lee, P. K. Kim, "A Study on the Short Text Categorization using SNS Feature Informations," The Journal of Korean Institute of Information Technology, Vol. 14, No. 6, pp. 159-165, June 2016.
10 D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3, pp.993-1022, January 2003.
11 C. Xing, Y. Wang, J. Liu, Y. Huang, and W. Y. Ma, "Hashtag-Based Sub-Event Discovery Using Mutually Generative LDA in Twitter," Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2666-2672, February 2016.
12 M. J. Paul and M. Dredze, "Discovering health topics in social media using topic models," PloS one, Vol. 9, No. 8, 2014.
13 X. Yan, J. Guo, Y. Lan, and X. Cheng, "A biterm topic model for short texts," In Proceedings of the 22nd international conference on World Wide Web, pp. 1445-1456, 2013.
14 D. Y. Kim, D. H. Kim, S. W. Kim, M. H. Jo, and E. J. Hwang, "SNS-based issue detection and related news summarization scheme," In Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, No. 114, January 2014.
15 J. Ito, J. Song, H. Toda, Y. Koike, and S. Oyama, "Assessment of tweet credibility with LDA features," In Proceedings of the 24th International Conference on World Wide Web, pp. 953-958, May 2015.
16 Wikipedia, Topic Model, [Internet] Available: https://ko.wikipedia.org/wiki/%ED%86%A0%ED%94%BD_%EB%AA%A8%EB%8D%B8
17 D. M. Blei and M. I. Jordan, "Variational inference for Dirichlet process mixtures," Bayesian analysis, Vol. 1, No. 1, pp. 121-143, 2006.   DOI
18 J. S. Liu, "The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem," Journal of the American Statistical Association, Vol. 89, No. 427, pp. 958-966, September 1994.   DOI
19 R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, "Improving lda topic models for microblogs via tweet pooling and automatic labeling," In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 889-892, July 2013.
20 K. W. Lim, C. Chen, and W. Buntine, "Twitter-network topic model: A full Bayesian treatment for social network and text modeling," NIPS 2013 Topic Models: Computation, Application, and Evaluation, arXiv preprint arXiv:1609.06791, 2016.
21 W. Chen, J. Wang, Y. Zhang, H. Yan, and X. Li, "User Based Aggregation for Biterm Topic Model," In ACL, Vol. 2, pp. 489-494, 2015.
22 T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.