Browse > Article

A Spam Mail Classification Using Link Structure Analysis  

Rhee, Shin-Young ((주)시뮬레이션연구소)
Khil, A-Ra (숭실대학교 컴퓨터학부)
Kim, Myung-Won (숭실대학교 컴퓨터학부)
Abstract
The existing content-based spam mail filtering algorithms have difficulties in filtering spam mails when e-mails contain images but little text. In this thesis we propose an efficient spam mail classification algorithm that utilizes the link structure of e-mails. We compute the number of hyperlinks in an e-mail and the in-link frequencies of the web pages hyperlinked in the e-mail. Using these two features we classify spam mails and legitimate mails based on the decision tree trained for spam mail classification. We also suggest a hybrid system combining three different algorithms by majority voting: the link structure analysis algorithm, a modified link structure analysis algorithm, in which only the host part of the hyperlinked pages of an e-mail is used for link structure analysis, and the content-based method using SVM (support vector machines). The experimental results show that the link structure analysis algorithm slightly outperforms the existing content-based method with the accuracy of 94.8%. Moreover, the hybrid system achieves the accuracy of 97.0%, which is a significant performance improvement over the existing method.
Keywords
Spam Mail Classification; Link Structure Analysis; Decision Tree; SVM;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Boykin, O., Roychowdhury, V., 'Personal email network: an effective anti-spam tool,' Arxiv preprint cond-mat/0402143, 2004 - arxiv.org, 2004
2 The Apache SpamAssassin Project, http://spamassassin.apache.org/
3 YALE(Yet Another Learning Environment), http://rapid-i.com/
4 서정우, 손태식, 서정택, 문종섭, 'Support Vector Machine을 사용한 스팸메일 탐지 방안', 한국정보과학회 2003 추계학술대회, 2003
5 Drucker, H., Wu, D., 'Support vector machines for spam categorization,' IEEE Transactions on Neural Networks, VOL. 10, NO. 5, 1999
6 Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C., 'An evaluation of naive bayesian anti-spam filtering,' In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning(ECML 2000), 2000
7 ZHANG, L., ZHU, J., YAO, T., 'An evaluation of statistical spam filtering techniques,' ACM Transactions on Asian Language Information Processing, Vol.3, No. 4, pp.243-269, 2004   DOI
8 Vieira, C., Mather, P., 'A comparative study of multiple classifier combination methods in remote sensing,' In Proceedings of the IC-AI'2000, Vol. 1, pp.39-46, 2000
9 i-config: Internet Content Filtering Group, http://www.iit.demokritos.gr/skel/i-config/
10 SpamArchive.org, http://spamarchive.org/
11 Carreras, X., Marquez, L., 'Boosting trees for antispam email filtering,' In Proceedings of RANLP-2001, 4th International Conference on Recent Advances in Natural Language Processing, 2001
12 Page, L., Brin, S., Motwani, R., Winograd, T., 'The pagerank citation ranking: bringing order to the web,' Technical Report, Stanford University, Stanford, CA, 1998
13 민도식, 송무희, 손기준, 이상조, 'SVM 분류 알고리즘을 이용한 스팸메일 필터링', 한국정보과학회 2003년 춘계학술대회, 2003