Browse > Article
http://dx.doi.org/10.13089/JKIISC.2008.18.6A.129

An Approach to Detect Spam E-mail with Abnormal Character Composition  

Lee, Ho-Sub (Korea University)
Cho, Jae-Ik (Korea University)
Jung, Man-Hyun (Korea University)
Moon, Jong-Sub (Korea University)
Abstract
As the use of the internet increases, the distribution of spam mail has also vastly increased. The email's main use was for the exchange of information, however, currently it is being more frequently used for advertisement and malware distribution. This is a serious problem because it consumes a large amount of the limited internet resources. Furthermore, an extensive amount of computer, network and human resources are consumed to prevent it. As a result much research is being done to prevent and filter spam. Currently, research is being done on readable sentences which do not use proper grammar. This type of spam can not be classified by previous vocabulary analysis or document classification methods. This paper proposes a method to filter spam by using the subject of the mail and N-GRAM for indexing and Bayesian, SVM algorithms for classification.
Keywords
Spam mail filtering; N-GRAM; Support Vector Machine; Bayesian Decision Theory;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 A. Bratko, "FIGHTING SPAM WITH DATA COMPRESSION MODELS", Virus bulletin, http://www.virusbtn.com, pp. s1-s4, Mar 2006
2 S. Cucerzan, E. Brill, "Spelling correction as an iterative process that exploits the collective knowledge of web users", In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 293-300, 2004
3 S. Theodoridis, K. Koutroumbas, Pattern recognition 3/E, Academic press, pp. 13-116, 2006
4 R. O. Duda, D. G. Stork, P. E. Hart, Pattern Classification 2/E, Wiley-Interscience, 2000
5 김현준, 정재은, 조근식, "가중치가 부여된 베이지안 분류자를 이용한 스팸 메일 필터링 시스템", 한국정보과학회논문지 : 소프트웨어 및 응용, 31(8), pp. 1092-1100, 2004
6 P. Resnick, "Internet Message Format", RFC Editor, 2001
7 V. Kumar, M. Steinbach, P. N. Tan, Introduction to Data Mining, Addison-Wesley, 2005
8 G. V. Cormark, "Email Spam Filtering: A Systematic Review", Foundations and Trends in Information Retrieval, 1(4), pp. 335-455, 2008   DOI
9 강승식, "메일 주소 유효성과 제목-내용 가중치 기법에 의한 스팸 메일 필터링", 한국멀티미디어학회논문지, 9(2), pp. 255-263, 2006
10 V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals.", Soviet Physics Doklady, 10(8), pp. 707-710, 1966
11 I. Cid, L. R. Janerio, J. R. Méndez, D. Glez- Peña, F. Fdez-Riverola, "The Impact of Noise in Spam Filtering: A Case Study", Advances in Data Mining. Medical Applications, E- Commerece, Marketing, and Theoretical Aspects, 8th Industrial Conference (ICDM 2008), Springer-verleg, LNCS 5077, pp. 228-241, 2008
12 서정우, 손태식, 서정택, 문종섭, "n-Gram 색인화와 Support Vector Machine을 사용한 스팸메일 필터링에 대한 연구", 정보보호학회논문지, 14(2), pp. 23-33, 2004
13 I. H. Witten, E. Frank, Data Mining: Practical machine learning tools and techniques 2/E, Morgan Kaufmann, 2005
14 공미경, 이경순, "스팸성 자질과 URL 자질의 공동 학습을 이용한 최대 엔트로피 기반 스팸메일 필터 시스템", 한국정보처리학회논문지 (B), 15B(1), pp. 61-68, 2008   과학기술학회마을   DOI
15 한학용, 패턴인식 개론: MATLAB 실습을 통한 입체적 학습, 한빛미디어, 2005
16 H. Lee, A.Y. Ng, "Spam deobfuscation using a Hidden Markov Model", Proceedings of the Second Conference on Email and Anti-Spam (CEAS05), July 2005
17 C. C. Chang, C. J. Lin, "LibSVM - A Library for Support Vector Machines", http://www.csie.ntu.edu.tw/~cjlin/libsvm
18 V. Freschi, A. Seraghiti, A. Boliolo, "Filtering Obfuscated Email Spam by means of Phonetic String Matching", 28th European Conference on IR Research (ECIR 2006), Springer-verleg, LNCS 3936, pp. 505-509, 2006