Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2008.15-B.1.61

A Spam Filter System Based on Maximum Entropy Model Using Co-training with Spamminess Features and URL Features  

Gong, Mi-Gyoung (전북대학교 컴퓨터공학과)
Lee, Kyung-Soon (전북대학교 전자정보공학부/영상정보신기술연구센터)
Abstract
This paper presents a spam filter system using co-training with spamminess features and URL features based on the maximum entropy model. Spamminess features are the emphasizing patterns or abnormal patterns in spam messages used by spammers to express their intention and to avoid being filtered by the spam filter system. Since spammers use URLs to give the details and make a change to the URL format not to be filtered by the black list, normal and abnormal URLs can be key features to detect the spam messages. Co-training with spamminess features and URL features uses two different features which are independent each other in training. The filter system can learn information from them independently. Experiment results on TREC spam test collection shows that the proposed approach achieves 9.1% improvement and 6.9% improvement in accuracy compared to the base system and bogo filter system, respectively. The result analysis shows that the proposed spamminess features and URL features are helpful. And an experiment result of the co-training shows that two feature sets are useful since the number of training documents are reduced while the accuracy is closed to the batch learning.
Keywords
Spamminess Feature; URL Feature; Spam Filter; Maximum Entropy Model; Co-Training;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. 'A Bayesian Approach to Filtering Junk E-mail', AAAI-98 Workshop on Learning for Text Categorization, 1998
2 Cormack, B., Lynam, T. 'TREC2005 Spam Track Overview', Text REtrieval Conference, 2005
3 Yang, K., Yu, N., George, N., Loehrlen, A., McCaulay, D., Zhang, H., Akram, S., Mei, J., Record, I. 'WIDIT in TREC 2005 HARD, Robust, and SPAM Tracks', Text REtrieval Conference, 2005
4 Keselj, V., Milios, E., Tuttle, A., Wang, S., Zhang, R. 'DalTREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques', Text REtrieval Conference, 2005
5 Wang, S., Wang, B., Lang, H., Cheng, X. 'CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance', Text REtrieval Conference, 2005
6 Blum, A. and Mitchell, T. M. 'Combining labeled and unlabeled data with co-training', Annual Conference on Computational Learning Theory, pp. 92-100, 1998
7 Kiritchenko, S. and Matwin, S. 'Email classification with co-training', Conference of the Centre for Advanced Studies on Collaborative Research, page 8, Toronto, Ontario, Canada, 2001
8 Assis, F., Yerazunis, W., Siefkes, C., Chhabra, S. 'CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track', Text REtrieval Conference, 2005
9 김현준, 정재은, 조근식 '가중치가 부여된 베이지안 분류자를 이용한 스팸 메일 필터링 시스템', 정보과학회논문지, 제 31권, 제8호, pp 1092-1100, 2004
10 Cao, W., An, A., Huang, X. 'York University at TREC 2005: SPAM Track', Text REtrieval Conference, 2005
11 Segal, R. 'IBM SpamGuru on the TREC 2005 Spam Track', Text REtrieval Conference, 2005
12 Bratko, A., Filipic, B. 'Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track', Text REtrieval Conference, 2005
13 Ion Androutsopoulos et al, 'An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages', International ACM SIGIR conference on Research and development in information retrieval, pp. 160-167, 2000
14 Breyer, L. A. 'DBACL at the TREC 2005', Text REtrieval Conference, TREC 2005
15 Robinson, G. A. 'Statistical Approach to the Spam Problem', Linux Journal, vol. 107, 2003. http://bogofilter.sourceforge.net/
16 Pierce, D. and Cardie, C. 'Limitations of Co-Training for natural language learning from large datasets', Conference on Empirical Methods in NLP, pp. 1-9, 2001
17 Ratnaparkhi, A. 'Maximum Entropy Models for Natural Language Ambiguity Resolution', Ph.D. Dissertation. University of Pennsylvania, 1998. http://maxent.sourceforge.net/(http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit. html)
18 Darroch, J.N. and Ratcliff, D. 'Generalized iterative scaling for log-linear models', The Annals of Mathematical Statistics, 1972