Comparing Feature Selection Methods in Spam Mail Filtering

  • Kim, Jong-Wan (School of Computer and Information Technology, Daegu University) ;
  • Kang, Sin-Jae (School of Computer and Information Technology, Daegu University)
  • Published : 2005.11.25

Abstract

In this work, we compared several feature selection methods in the field of spam mail filtering. The proposed fuzzy inference method outperforms information gain and chi squared test methods as a feature selection method in terms of error rate. In the case of junk mails, since the mail body has little text information, it provides insufficient hints to distinguish spam mails from legitimate ones. To address this problem, we follow hyperlinks contained in the email body, fetch contents of a remote web page, and extract hints from both original email body and fetched web pages. A two-phase approach is applied to filter spam mails in which definite hint is used first, and then less definite textual information is used. In our experiment, the proposed two-phase method achieved an improvement of recall by 32.4% on the average over the $1^{st}$ phase or the $2^{nd}$ phase only works.

Keywords