Browse > Article
http://dx.doi.org/10.7840/kics.2017.42.2.493

Detecting Spam Data for Securing the Reliability of Text Analysis  

Hyun, Yoonjin (Kookmin University The Graduate School of Business Information Technology)
Kim, Namgyu (Kookmin University School of MIS)
Abstract
Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.
Keywords
Text Analysis; Text Mining; Topic Modeling; Spam Detection;
Citations & Related Records
Times Cited By KSCI : 10  (Citation Analysis)
연도 인용수 순위
1 G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Commun. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975.   DOI
2 S. M. Weiss, N. Indurkhya, and T. Zhang, Fundamentals of Predictive Text Mining, Springer, 2010.
3 J. Kim, N. Kim, and Y. Cho, "Userperspective issue clustering using multilayered two-mode network analysis," J. Intell. Inf. Syst., vol. 20, no. 2, pp. 93-107, Jun. 2014.   DOI
4 Y. Hyun, N. Kim, and Y. Cho, "A multi-dimensional issue clustering from the perspective consumers' interests and R&D," J. Inf. Technol. Serv., vol. 14, no. 1, pp. 237- 249, Mar. 2015.
5 S. Choi, Y. Hyun, and N. Kim, "Improving performance of recommendation systems using topic modeling," J. Intell. Inf. Syst., vol. 21, no. 3, pp. 101-116, Sept. 2015.   DOI
6 Y. Hyun, N. Kim, and Y. Cho, "Interest-based customer segmentation methodology using topic modeling," J. Inf. Technol. Appl. & Management, vol. 22, no. 1, pp. 77-93, Mar. 2015.   DOI
7 D. Kim, W. X. S. Wong, M. Lim, C. Liu, N. Kim, J. Park, W. Kil, and H. Yoon, "A methodology for analyzing public opinion about science and technology issues using text analysis," J. Inf. Technol. Serv., vol. 14, no. 3, pp. 33-48, Sept. 2015.
8 M. Lim and N. Kim, "Investigating dynamic mutation process of issues using unstructured text analysis," J. Intell. Inf. Syst., vol. 22, no. 1, pp. 1-18, Mar. 2016.   DOI
9 M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A bayesian approach to filtering junk e-mail," in AAAI Workshop on Learning for Text Categorization, vol. 62, pp. 98-105, Jul. 1998.
10 X. Jia, K. Zheng, W. Li, T. Liu, and L. Shang, "Three-way decisions solution to filter spam email: An empirical study," Int. Conf. Rough Sets and Current Trends in Comput., pp. 287-296, Heidelberg, Berlin, Aug. 2012.
11 I. Joe and H. T. Shim, "A SVM-based spam filtering system for short message service (SMS)," J. KICS, vol. 34, no. 9, pp. 908-913, Sept. 2009.
12 B. Klimt and Y. Yang, "Introducing the enron corpus," CEAS 2004, First Conf. Email and Anti-Spam, California, USA, Jul. 2004.
13 Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, "Combating web spam with trustrank," VLDB '04, pp. 576-587, Toronto, Canada, Aug. 2004.
14 Z. Gyongyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen, "Link spam detection based on mass estimation," VLDB '06, pp. 439-450, Seoul, Korea, Sept. 2006.
15 A. Ntoulas, M. Najork, M. Manasse, and D. Retterly, "Detecting spam web pages through content analysis," in Proc. 15th Int. Conf. World Wide Web, pp. 83-92, Edinburgh, Scotland, May 2006.
16 McKinsey Global Institute, Big Data: The next Frontier for Innovation, Competition, and Productivity, McKinsey and Company, 2011.
17 P. Xanthopoulos, O. P. Panagopoulos, G. A. Bakamitsos, and E. Freudmann, "Hashtag hijacking: What it is, why it happens and how to avoid it," J. Digital & Social Media Marketing, vol. 3, no. 4, pp. 353-362, Feb. 2016.
18 S. Sedhai and A. Sun, "Effect on spam on hashtag recommendation for tweets," in Proc. 25th Int. Conf. Companion on World Wide Web, pp. 97-98, Quebec, Canada, Apr. 2016.
19 J. Jung and M. Yoo, "Tag search system using the keyword extraction and similarity evaluation," The J. Korean Inst. Commun. Inf. Sci., vol. 40, no. 12, pp. 2458-2487, Dec. 2015.
20 Economist Intelligence Unit, Big Data Harnessing a Game-Changing Asset, The Economist, 2011.
21 M. Egele, G. Stringhini, C. Kruegel, and G. Vigna, "Compa: Detecting compromised accounts on social networks," in Proc. Ann. Netw. Distrib. Syst. Security Symp., San Diego, CA, 2013.
22 Gartner Inc., 2012 Hype Cycle for Emerging Technologies, Gartner Inc., 2011.
23 C. Chen, J. Zhang, Y. Xiang, and W. Zhou, "Spammers are becoming "Smarter" on twitter," Browse J. & Mags., vol. 18, no. 2, 2016.
24 B. Liu, Sentiment analysis and opinion mining, syntehesis lectures on human language technologies #16, Morgan & Claypool Publisiers, 2012.
25 J. Song, S. Lee, and J. Kim, "Spam filtering in twitter using sender-receiver relationship. Recent advances in intrusion detection," Int. Workshop on Recent Advances in Intrusion Detection, pp. 301-317, Heidelberg, Berlin, Sept. 2011.
26 S. Yarde, D. Romero, G. Schoenebeck, and D. Boyd, "Detecting spam in a twitter network," First Monday, vol. 15, no. 1, Jan. 2010.
27 A. H. Wang, "Don't follow me: Spam detection in twitter," IEEE SECRYPT, pp. 1-10, Athens, Greece, Jul. 2010.
28 Y. Ma, Y. Niu, Y. Ren, and Y. Xue, "Detecting spam on sina weibo," CCIS, Oct. 2013.
29 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Ed., Morgan Kaufmann Publishers, 2011.
30 S. Lee and J. Kim, "Warningbird: A near real-time detection system for suspicious URLs in twitter stream," IEEE Trans. Dependable and Secure Comput., vol. 10, no. 3, pp. 183-195, Jan. 2013.   DOI
31 R. J. Mooney and R. Bunescu, "Mining knowledge from text using information extraction," ACM SIGKDD Explorations Newsletter - Natural Lang. Process. and Text Mining, vol. 7, no. 1, pp. 3-10, Jun. 2006.
32 C. J. V. Rijsbergen, Information Retrieval, 2nd Ed., Butterworth, London, 1979.
33 T. N. Phan and M. Yoo, "Facebook fan page evaluation system based on user opinion mining," The J. Korean Inst. Commun. and Inf. Sci., vol. 40, no. 12, pp. 2488-2490, Dec. 2015.   DOI
34 K. Kim and H. Ahn. "Development of web-based intelligent recommender systems using advanced data mining techniques," J. Inf. Technol. Appl. Management, vol. 12, no. 3, pp. 41-56, Sept. 2005.
35 J. Hur and J. W. Kim, "Characteristics on inconsistency pattern modeling as hybrid data mining techniques," J. Inf. Technol. Appl. Management, vol. 15, no. 1, pp. 225-242, Mar. 2008.
36 I. Hwang, "A study on dynamic query expansion using web mining in information retrieval," J. Inf. Technol. Appl. Management, vol. 11, no. 2, pp. 227-237, Jun. 2004.
37 J. Moon, I. Jang, Y. C. Choe, J. G. Kim, and G. Bock, "Case study of big data-based agri-food recommendation system according to types of customers," The J. Korean Inst. Commun. Inf. Sci., vol. 40, no. 5, pp. 903-913, May 2015.   DOI
38 R. Albright, Taming Text with the SVD, SAS Institute Inc., 2006.