Browse > Article
http://dx.doi.org/10.3745/KTSDE.2017.6.9.445

A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling  

Choi, Hyeok-Jun (충남대학교 컴퓨터공학과)
Park, Cheong Hee (충남대학교 컴퓨터공학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.6, no.9, 2017 , pp. 445-456 More about this Journal
Abstract
In this paper, we propose a method to detect spam tweets containing unhealthy information by using an n-gram dictionary under limited labeling. Spam tweets that contain unhealthy information have a tendency to use similar words and sentences. Based on this characteristic, we show that spam tweets can be effectively detected by applying a Naive Bayesian classifier using n-gram dictionaries which are constructed from spam tweets and normal tweets. On the other hand, constructing an initial training set requires very high cost because a large amount of data flows in real time in a twitter. Therefore, there is a need for a spam detection method that can be applied in an environment where the initial training set is very small or non exist. To solve the problem, we propose a method to generate pseudo-labels by utilizing twitter's retweet function and use them for the configuration of the initial training set and the n-gram dictionary update. The results from various experiments using 1.3 million korean tweets collected from December 1, 2016 to December 7, 2016 prove that the proposed method has superior performance than the compared spam detection methods.
Keywords
Twitter; Spam; Retweet; N-Gram; Pseudo-Label;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 H. N. Lee, M. G. Song, and E. G. Im, "A Study on Structuring Spam Short Message Service(SMS) filter," in Proc. Symposium of the Korean Institute of communications and Information Sciences, pp.1072-1073, 2011.
2 Statista, Number of Monthly Active Twitter Users Worldwide from 1st quarter 2010 to 4th quarter 2016 (in millions) [Internet], https://www.statista.com/statistics/282087/numberof-monthly-active-twitter-users/.
3 David Sayce, Number of tweets per day? [Internet], http://www.dsayce.com/social-media/tweets-day/.
4 L. M. Aiello et al., "Sensing Trending Topics in Twitter," IEEE Trans. Multimedia., Vol.15, No.6, pp.1268-1282, 2013.   DOI
5 T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake Shakes Twitter Users: Real-Time Event Detection by Social Sensors," in Proc. 19th International Conference on World Wide Web, ACM, pp. 851-860, 2010.
6 A. I. Baqapuri, S. Saleh, M. U. Ilyas, "Sentiment Classification of Tweets using Hierarchical Classification," in Proc. IEEE International Conference on Communications, IEEE, 2016.
7 Neal Ungerleider, Almost 10% of Twitter Is Spam [Internet], https://www.fastcompany.com/3044485/almost-10-of-twitter-is-spam/.
8 Judy Mottl, Twitter acknowledges 23 million active users are actually bots [Internet], http://www.techtimes.com/articles/12840/20140812/twitter-acknowledges-14-percent-users-bot s-5-percent-spam-bots.htm/.
9 C. Chen, J. Zhang, Y. Xiang, W. Zhou, and J. Oliver, "Spammers Are Becoming "Smarter" on Twitter," IEEE Trans. IT Professional., Vol.18, No.2, pp.66-70, 2016.
10 H. J. Choi and C. H. Park, "A Twitter Spam Detection Method based on n-gram Dictionary," in Proc. Korea Computer Congress, Jeju, pp.227-229, 2017.
11 K. Tao, F. Abel, C. Hauff, G. J. Houben, and U. Gadiraju, "Groundhog Day: Near-Duplicate Detection on Twitter," in Proc. 22nd International Conference on World Wide Web, ACM, pp.1273-1284, 2013.
12 K. M. Lee, J. Caverlee, and S. Webb, "Uncovering social spammers : social honeypots + machine learning," in Proc. 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp.435-442, 2010.
13 F. Benevenuto, G. magno, T. Rodrigues, and V. Almeida, "Detecting spammers on Twitter," Presented at the 7th annual Collaboration Electronic Messaging Anti-Abuse Spam Conference (CEAS), Vol.6, 2010.
14 A. H. Wang, "Don't follow me : spam detection in twitter," in Proc. International Conference on Security and Cryptography (SECRYPT), 2010.
15 G. Stringhini, C. Kruegel, and G. Vigna, "Detecting spammers on social networks," in Proc. 26th Annual Computer Security Applications Conference, ACM, pp.1-9, 2010.
16 S. Liu, J. Zhang, and Y. Xiang, "Statistical Detection of Online Drifting Twitter Spam," in Proc. 11th ACM on Asia Conference on Computer and Communications Security, ACM, pp.1-10, 2016.
17 C. Chen, et al, "A Performance Evaluation of Machine Learning-Based Streaming Spam Tweet Detection," IEEE Trans. Computational Social Systems, Vol.2, No.3, pp.65-75. 2015.   DOI
18 C. Chen, J. Zhang, Y. Xiang, and W. Zhou, "Asymmetric Self-Learning for Tackling Twitter Spam Drift," in Proc. IEEE Conference on Computer Communications Workshops, IEEE, pp.208-213, 2015.
19 J. Song, S. Lee, and J. Kim, "Spam filtering in Twitter using sender-reeiver relationship," in Proc. 14th International Conference on Recent Advances in Intrusion Detection, Springer Berlin/Heidelberg, pp.301-317, 2011.
20 C. Yang, R. Harkreader, and G. Gu, "Empirical evaluation and new design for fighting evolving twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013.   DOI
21 K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, "Design and evaluation of a real-time URL spam filtering service," in Proc. IEEE Symposium on Security and Privacy, Washington, pp.447-462, 2011.
22 S. H. Lee and J. Kim, "Warningbird : A near real-time detection system for suspicious URLs in Twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013   DOI
23 D. M. Freeman, "Using Naive Bayes to Detect Spammy Names in Social Networks," in Proc. the 2013 ACM Workshop on Artificial Intelligence and Security, ACM, pp. 3-12, 2013
24 Twitter, Inc., Streaming APIs [Internet], https://dev.twitter.com/streaming/overview.
25 A. Herdagdelen, "Twitter n-gram corpus with demographic metadata," Language Resources and Evaluation, Vol.47, No. 4, pp.1127-1147, 2013.   DOI
26 S. J. Lee and D. J. Choi, "Personalized Mobile Junk Message Filtering System," The Journal of the Korea Contents Association, Vol.11, No.12, pp.122-135, 2010.   DOI
27 S. W. Lee, "Spam Filter by Using X2 Statistics and Support Vector Machines," KIPS Journal B (2001-2012), Vol.17B, No.3, pp.249-254, 2010.
28 I. W. Joe and H. T. Shim, "A SVM-based Spam Filtering System for Short Message Service (SMS)," The Journal of The Korean Institute of Communication Sciences, Vol.34, No.9, pp.908-913, 2009.
29 Y. H. Kim et al., "Spam Twit Filtering using NaÏve Bayesian Algorithm and URL Analysis," in Proc. Korean Institute of Information Scientists and Engineers, Vol.38, No.2B, pp. 375-378, Nov., 2011.
30 Cyren, Q3 Trend Report Highlights Real-Time Malware Campagigns And Increase In Phishing [Internet], https://blog.cyren.com/articles/commtouch-internet-threats-trendreport-q3-2013.html.
31 V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes-Which Naive Bayes?," in Proc. the Third Conference on Email and Anti-Spam, pp.28-69, 2006.
32 J. Graovac, "Text Categorization Using n-Gram Based Language Independent Techniques," in Proc. 35th Anniversary of Computational Linguistics, pp.124-135, 2014.
33 Machine Learning Group at the University of Waikato, Weka3: Data Mining Software in Java [Internet], http://www.cs.waikato.ac.nz/ml/weka/.