DOI QR코드

DOI QR Code

WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS

WV-BTM: SNS 단문의 주제 분석을 위한 토픽 모델 정확도 개선 기법

  • Song, Ae-Rin (Department of IT Engineering, Sookmyung Women's University) ;
  • Park, Young-Ho (Department of IT Engineering, Sookmyung Women's University)
  • 송애린 (숙명여자대학교 공과대학 IT공학과) ;
  • 박영호 (숙명여자대학교 공과대학 IT공학과)
  • Received : 2017.12.18
  • Accepted : 2018.01.29
  • Published : 2018.01.31

Abstract

As the amount of users and data of NS explosively increased, research based on SNS Big data became active. In social mining, Latent Dirichlet Allocation(LDA), which is a typical topic model technique, is used to identify the similarity of each text from non-classified large-volume SNS text big data and to extract trends therefrom. However, LDA has the limitation that it is difficult to deduce a high-level topic due to the semantic sparsity of non-frequent word occurrence in the short sentence data. The BTM study improved the limitations of this LDA through a combination of two words. However, BTM also has a limitation that it is impossible to calculate the weight considering the relation with each subject because it is influenced more by the high frequency word among the combined words. In this paper, we propose a technique to improve the accuracy of existing BTM by reflecting semantic relation between words.

SNS의 사용자와 데이터량이 폭발적으로 증가함에 따라, SNS 빅 데이터를 기반으로 한 연구들이 활발히 진행되고 있다. 특히 소셜 마이닝 분야에서는 비 분류된 대용량 SNS 텍스트 데이터로부터 각 텍스트 별 유사성을 파악하고, 그로부터 트렌드를 추출하기 위해 대표적인 토픽 모델 기법인 LDA를 사용한다. 그러나 LDA는 단문 데이터에 대하여 비 빈발 단어 출현으로 인한 의미 희박성(semantic sparsity)으로 인해 양질의 주제 추론이 어렵다는 한계를 가진다. BTM 연구는 이와 같은 LDA의 한계점을 두 단어의 조합을 통해 개선하였으나, BTM 또한 조합된 단어 중 높은 빈도수의 단어에 더 큰 영향을 받아 각 주제와의 연관성을 고려한 가중치 계산이 불가능하다는 한계점을 지닌다. 본 논문은 단어 간의 의미적 연관성을 반영함으로써 기존 연구 BTM의 정확도를 개선하는 방안을 모색한다.

Keywords

References

  1. Korea National Statistical Office. Analysis of Consumer Propensity using SNS Data. 2015.
  2. H. Shim and K. Lim. "Research on the Effect of Different motivations on the Participation in SNSs," Journal of Digital Contents Society, Vol. 12, No. 3, pp. 383-390, 2011. https://doi.org/10.9728/dcs.2011.12.3.383
  3. Statista. Number of Social Media Users. [Internet] Available:https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
  4. Intel. What Happens in an Internet Minute? 2013. Available: https://newsroom.intel.com/press-kits/big-data-intelligence-begins-with-intel/
  5. A. Oulasvirta, E. Lehtonen, E. Kurvinen,, and M. Raento, "Making the ordinary visible in microblogs," Personal and ubiquitous computing, Vol. 14, No. 3, pp. 237-249, 2010. https://doi.org/10.1007/s00779-009-0259-y
  6. S. H. Na, J. I. Kim, E. J. Lee, P. K. Kim, "A Study on the Short Text Categorization using SNS Feature Informations," The Journal of Korean Institute of Information Technology, Vol. 14, No. 6, pp. 159-165, June 2016.
  7. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3, pp.993-1022, January 2003.
  8. M. C. Yang and H. C. Rim, "Identifying interesting Twitter contents using topical analysis," Expert Systems with Applications, Vol. 41, No. 9, pp.4330-4336, July 2014. https://doi.org/10.1016/j.eswa.2013.12.051
  9. C. Xing, Y. Wang, J. Liu, Y. Huang, and W. Y. Ma, "Hashtag-Based Sub-Event Discovery Using Mutually Generative LDA in Twitter," Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2666-2672, February 2016.
  10. M. J. Paul and M. Dredze, "Discovering health topics in social media using topic models," PloS one, Vol. 9, No. 8, 2014.
  11. D. Y. Kim, D. H. Kim, S. W. Kim, M. H. Jo, and E. J. Hwang, "SNS-based issue detection and related news summarization scheme," In Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, No. 114, January 2014.
  12. J. Ito, J. Song, H. Toda, Y. Koike, and S. Oyama, "Assessment of tweet credibility with LDA features," In Proceedings of the 24th International Conference on World Wide Web, pp. 953-958, May 2015.
  13. Wikipedia, Topic Model, [Internet] Available: https://ko.wikipedia.org/wiki/%ED%86%A0%ED%94%BD_%EB%AA%A8%EB%8D%B8
  14. D. M. Blei and M. I. Jordan, "Variational inference for Dirichlet process mixtures," Bayesian analysis, Vol. 1, No. 1, pp. 121-143, 2006. https://doi.org/10.1214/06-BA104
  15. J. S. Liu, "The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem," Journal of the American Statistical Association, Vol. 89, No. 427, pp. 958-966, September 1994. https://doi.org/10.1080/01621459.1994.10476829
  16. X. Yan, J. Guo, Y. Lan, and X. Cheng, "A biterm topic model for short texts," In Proceedings of the 22nd international conference on World Wide Web, pp. 1445-1456, 2013.
  17. R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, "Improving lda topic models for microblogs via tweet pooling and automatic labeling," In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 889-892, July 2013.
  18. K. W. Lim, C. Chen, and W. Buntine, "Twitter-network topic model: A full Bayesian treatment for social network and text modeling," NIPS 2013 Topic Models: Computation, Application, and Evaluation, arXiv preprint arXiv:1609.06791, 2016.
  19. W. Chen, J. Wang, Y. Zhang, H. Yan, and X. Li, "User Based Aggregation for Biterm Topic Model," In ACL, Vol. 2, pp. 489-494, 2015.
  20. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
  21. J. R. Firth, "A synopsis of linguistic theory," Studies in linguistic analysis, pp. 1930-1955, 1957.
  22. S. Bird, "NLTK: the natural language toolkit," In Proceedings of the COLING/ACL on Interactive presentation sessions, Association for Computational Linguistics, pp. 69-72, July 2006.

Cited by

  1. Development of Artificial Intelligence-based Legal Counseling Chatbot System vol.26, pp.3, 2018, https://doi.org/10.9708/jksci.2021.26.03.029