Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2011.18B.5.315

A Study on Spam Document Classification Method using Characteristics of Keyword Repetition  

Lee, Seong-Jin (숭실대학교 컴퓨터학과)
Baik, Jong-Bum (숭실대학교 컴퓨터학과)
Han, Chung-Seok (숭실대학교 컴퓨터학과)
Lee, Soo-Won (숭실대학교 컴퓨터학부)
Abstract
In Web environment, a flood of spam causes serious social problems such as personal information leak, monetary loss from fishing and distribution of harmful contents. Moreover, types and techniques of spam distribution which must be controlled are varying as days go by. The learning based spam classification method using Bag-of-Words model is the most widely used method until now. However, this method is vulnerable to anti-spam avoidance techniques, which recent spams commonly have, because it classifies spam documents utilizing only keyword occurrence information from classification model training process. In this paper, we propose a spam document detection method using a characteristic of repeating words occurring in spam documents as a solution of anti-spam avoidance techniques. Recently, most spam documents have a trend of repeating key phrases that are designed to spread, and this trend can be used as a measure in classifying spam documents. In this paper, we define six variables, which represent a characteristic of word repetition, and use those variables as a feature set for constructing a classification model. The effectiveness of proposed method is evaluated by an experiment with blog posts and E-mail data. The result of experiment shows that the proposed method outperforms other approaches.
Keywords
Spam Filtering; Spam; Spamdexing; Term Spamming; Word Repetition;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Yitong Wang, Xiaofei Chen and Xiaojun Feng, "Combating Link Spam by Noisy Link Analysis", Advanced Data Mining and Applications:Lecture Notes in Computer Science, Vol.6440/2010, pp.453-464, 2010.
2 Luca Becchetti, Carlos Castillo, Debora Donato, Ricardo Baeza-YATES, Stefano Leonardi, "Link Analysis for Web Spam Detection", Journal of ACM Transactions on the Web, Vol.2, No.1, 2008.
3 BAEZA-YATES R, BOLDI P, AND CASTILLO C, "Generalizing pagerank:Damping functions for link-based ranking algorithms", In Proceedings of ACM SIGIR, 2006
4 Jon M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment", Journal of ACM, 1999.
5 Amy Langville and Carl Meyer. "Deeper inside PageRank", Technical report, North Carolina State University, 2003.
6 Enrico Blanzieri and Anton Bryl, "A survey of Learning-based Techniques of Email Spam Filtering", Artificial Intelligence Review, Springer, 2008.
7 Pantel P and Lin D, "Spamcop:a spam classification & organization program", In AAAI'98 Workshop, Learning for Text Categorization, 1998.
8 Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C and Stamatopoulos P, "Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach". In workshop on machine learning and textual information access, 4th European conference on principles and practice of knowledge discovery in databases, PKDD 2000, 2000.
9 Sahami M, Dumais S, Heckerman D and Horvitz E, "A bayesian approach to filtering junk e-mail", In AAAI'98 Workshop, Learning for Text Categorization, 1998.
10 Li K and Zhong Z, "Fast statistical spam filter by approximate classifications", In SIGMETRICS 2006, 2006.
11 Drucker H, Wu D and Vapnik V, "Support vector machines for spam categorization", IEEE Transactions on Neural Networks, Vol.10, No.5, pp.1048-1054, 1999.   DOI   ScienceOn
12 이신영, 길아라, 김명원, "링크구조분석을 이용한 스팸 메일 분류", 정보과학회논문지:소프트웨어 및 응용 제34권 제1호, 2007. 01.
13 이호섭, 조재익, 정만현, 문종섭, "비정상 문자로 조합으로 구성 된 스팸 메일 탐지 방법", 정보보호학회논문지, 제18권 제6(A) 호, 2008. 12.
14 Archana Bhattarai, Vasile Rus, Dipankar Dasgupta, "Characterizing Comment Spam in the Blogsphere through Content Analysis", ACM Transactions on the Web, Vol.2, No.1, Article 2, 2009
15 Hassan Najadat1, Ismail Hmeidi, "Web Spam Detection Using Machine Learning in Specific Domain Features", Journal of Information Assurance and Security 3 (2008) 220-229, 2009.
16 "2010년 인터넷이용실태조사", 방송통신위원회,한국인터넷진흥원, 2010. 9.
17 "2008 불법스팸방지 가이드라인", 방송통신위원회, 한국정보보호진흥원, 2008. 11.
18 Zoltan Gyongyi, Hector Garcia-Molina, "Web Spam Taxonomy", Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, 2005.