[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5391/IJFIS.2006.6.2.105

Analyzing the Effect of Lexical and Conceptual Information in Spam-mail Filtering System

Kang Sin-Jae (School of Computer and Information Technology, Daegu University)
Kim Jong-Wan (School of Computer and Information Technology, Daegu University)

Publication Information

International Journal of Fuzzy Logic and Intelligent Systems / v.6, no.2, 2006 , pp. 105-109 More about this Journal

Abstract

In this paper, we constructed a two-phase spam-mail filtering system based on the lexical and conceptual information. There are two kinds of information that can distinguish the spam mail from the ham (non-spam) mail. The definite information is the mail sender's information, URL, a certain spam keyword list, and the less definite information is the word list and concept codes extracted from the mail body. We first classified the spam mail by using the definite information, and then used the less definite information. We used the lexical information and concept codes contained in the email body for SVM learning in the 2nd phase. According to our results the ham misclassification rate was reduced if more lexical information was used as features, and the spam misclassification rate was reduced when the concept codes were included in features as well.

Keywords

Information filtering; spam-mail filtering; thesaurus; concept information;

Citations & Related Records

Reference

1	H. Drucker, D. Wu, and V. Vapnik, 'Support Vector Machines for Spam Categorization,' IEEE Trans. on Neural Networks, vol.10, no.5, pp. 1048-1054, 1999 DOI ScienceOn
2	T. Joachims, 'Text Categorization with Support Vector Machines: Learning with Many Relevant Features,' ECML, Claire Nedellec and Celine Rouveirol (ed.), 1998
3	K. H. Moon, and J. H. Lee, 'Representation and Recognition Method for Multi-Word Translation Units in Korean-to-Japanese MT System,' In the 18th International Conference on Computational Linguistics (COLING 2000), Germany, pp. 544-550, 2000
4	P. J. Resnick, D. L. Hansen, and C. R. Richardson, 'Calculating Error Rates for Filtering Software,' Communications of ACM, vol.47, no.9, pp. 67-71, 2004 DOI ScienceOn
5	L. F. Cranor, and B. A. LaMacchia, 'Spam!,' Communications of ACM, vol.41, no.8, pp. 74-83, 1998
6	H. F. Li, N. W. Heo, K. H. Moon, J. H. Lee, and G. B. Lee, 'Lexical Transfer Ambiguity Resolution Using Automatically-Extracted Concept Co-occurrence Information,' International Journal of Computer Processing of Oriental Languages, World Scientific Pub., vol.13, no. 1 , pp. 53-68, 2000 DOI
7	J. Yang, V. Chalasani, and S. Park, 'Intelligent categorization based on textual information metadata,' IEICE Transactions on information System, vol.E86-D, no.7, pp. 1280-1288, 2003
8	Kim, J. W., Kim, H. J., Kang, S. J., and Kim, B. M., 'Determination of Usenet News Groups by Fuzzy Inference and Kohonen Network,' Lecture Notes in Artificial Intelligence, vol.3157, Springer-Verlag, pp. 654-663, 2004
9	I. H. Witten, and E. Frank, Data Mining: Practical machine learning tools and Techniques with java implementations, Morgan Kaufmann, 2000
10	Gordon V. Cormack, Overview of the TREC 2005 Spam Track, http://plg.uwaterloo.ca/~gvconnac/ trecspamtrack05, 2005
11	S. Ohno, and M. Hamanishi, New Synonyms Dictionary, Kadokawa Shoten, Tokyo, 1981
12	M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, 'A bayesian approach to filtering junk e-mail,' In AAAI-98 Workshop on Learning for Text Categorization, pp. 55-62, 1998
13	C. J. Park, J. H. Lee, G. B. Lee, and K. Kakechi, 'Collocation-Based Transfer Method in Japanese-Korean Machine Translation,' Transaction of information Processing Society of Japan, vol.38, no.4, pp. 707-718, 1997
14	V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995