Browse > Article
http://dx.doi.org/10.5391/JKIIS.2014.24.5.482

Semi-supervised learning for sentiment analysis in mass social media  

Hong, Sola (College of Information and Communication Engineering, Sungkyunkwan University)
Chung, Yeounoh (College of Information and Communication Engineering, Sungkyunkwan University)
Lee, Jee-Hyong (College of Information and Communication Engineering, Sungkyunkwan University)
Publication Information
Journal of the Korean Institute of Intelligent Systems / v.24, no.5, 2014 , pp. 482-488 More about this Journal
Abstract
This paper aims to analyze user's emotion automatically by analyzing Twitter, a representative social network service (SNS). In order to create sentiment analysis models by using machine learning techniques, sentiment labels that represent positive/negative emotions are required. However it is very expensive to obtain sentiment labels of tweets. So, in this paper, we propose a sentiment analysis model by using self-training technique in order to utilize "data without sentiment labels" as well as "data with sentiment labels". Self-training technique is that labels of "data without sentiment labels" is determined by utilizing "data with sentiment labels", and then updates models using together with "data with sentiment labels" and newly labeled data. This technique improves the sentiment analysis performance gradually. However, it has a problem that misclassifications of unlabeled data in an early stage affect the model updating through the whole learning process because labels of unlabeled data never changes once those are determined. Thus, labels of "data without sentiment labels" needs to be carefully determined. In this paper, in order to get high performance using self-training technique, we propose 3 policies for updating "data with sentiment labels" and conduct a comparative analysis. The first policy is to select data of which confidence is higher than a given threshold among newly labeled data. The second policy is to choose the same number of the positive and negative data in the newly labeled data in order to avoid the imbalanced class learning problem. The third policy is to choose newly labeled data less than a given maximum number in order to avoid the updates of large amount of data at a time for gradual model updates. Experiments are conducted using Stanford data set and the data set is classified into positive and negative. As a result, the learned model has a high performance than the learned models by using "data with sentiment labels" only and the self-training with a regular model update policy.
Keywords
Twitter; Sentiment analysis; Semi-supervised learning; Self-training; SVM;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 I. S. Kang, "A Comparative Study on Using SentiWordNet for English Twitter Sentiment Analysis," Journal of The Korean Institute of Intelligent System, vol. 23, no. 4, pp. 384-388, 2013.   과학기술학회마을   DOI   ScienceOn
2 B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up? Sentiment classification using machine learning techniques," In Proceeding of the ACL-02 conference on Empirical methods in natural language processing. Volume 10. Association for Computational Linguistics, pp. 79-86, 2002.
3 H. H. Kang, S. J. Yoo, and D. I. Han, "Design and Implementation of System for Classifying Review of Product Attribute to Positive/Negative," In proceeding of The 36th KIISE Fall Conference, vol. 36, no. 2, pp. 1-6, 2009.
4 A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, "Sentiment analysis of twitter data," In Proceeding of the Workshop on Languages in Social Media. Association for Computational Linguistics, pp.30-38. 2011,
5 A. Hogenboom, D. Bal, F. Frasincar, M. Bal, F. de Jong, and U. Kaymak, "Exploiting Emoticons in Sentiment Analysis," In Proceeding of the 28th Annual ACM Symposium on Applied Computing ACM, pp. 703-710, 2013.
6 J. H. Yeon, D. J. Lee, J. H. Shim, and S. G. Lee, "Product Review Data and Sentiment Analytical Processing Modeling," The Journal of Society for e-Business Studies, vol. 16, no. 4, pp. 125-137, 2011.   과학기술학회마을   DOI   ScienceOn
7 H. J. Yune, H. J. Kim, and J. Y. Chang, "An Eficient Search Method of Product Reviews using Opinion Mining Technique," The Journal of KIISE, vol. 16, no. 2, pp. 222-226, 2010.   과학기술학회마을
8 C. CORTES, V. VAPNIK, "Support-vector networks," Machine learning, vol. 20, no. 3, pp. 273-297, 1995.
9 K. M. Kim, J. D. Lee, and J. H. Lee, "Sentiment Classification using Extracted Rationale Words by Genetic Algorithm," In Proceeding of the 14th International Symposium on Advanced Intelligent Systems, pp. 36-43, 2013.
10 H. G. Yeom, S. M. Park, J. J. Park, and K. B. Sim, "Superiority Demonstration of Variance-Considered Machines by Comparing Error Rate with Support Vector Machines," International Journal of Control, Automation, and Systems, vol. 9, no. 3, pp. 595-600, 2011.   과학기술학회마을   DOI
11 Yun, "Evolution of big data - The future of IT services to resemble a human," Available: http://cfono1.tistory.com/704, 2013, [Accessed: August 1, 2014].
12 H. J. Lee, H. J. Shin, S. Z. Cho, and D. MacLachlan, "Semi-supervised response modeling," Journal of Interactive Marketing, vol. 24, no. 1, pp. 42-54, 2010.   DOI   ScienceOn
13 K. Soranaka, M. Matsushita, "Relationship Between Emotional Words and Emoticons in Tweets," In Proceeding of Technologies and Application of Artificial Intelligence, pp.262-265, 2012.
14 C. Li, K. Liu, and H. Wang, "The incremental learning algorithm with support vector machine based on hyperplane-distance," Applied Intelligence, pp.19-27, 2011.