An Automatic Spam e-mail Filter System Using χ<sup>2</sup> Statistics and Support Vector Machines

Lee, Songwook;

Proceedings of the Korean Institute of Information and Commucation Sciences Conference (한국정보통신학회:학술대회논문집)

2009.05a
/
Pages.592-595
/
2009

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

An Automatic Spam e-mail Filter System Using χ² Statistics and Support Vector Machines

카이 제곱 통계량과 지지벡터기계를 이용한 자동 스팸 메일 분류기

Lee, Songwook (Chungju National University)

이성욱 (충주대학교)

Published : 2009.05.29

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

We propose an automatic spam mail classifier for e-mail data using Support Vector Machines (SVM). We use a lexical form of a word and its part of speech (POS) tags as features. We select useful features with ${\chi}^2$ statistics and represent each feature using text frequency (TF) and inversed document frequency (IDF) values for each feature. After training SVM with the features, SVM classifies each email as spam mail or not. In experiment, we acquired 82.7% of accuracy with e-mail data collected from a web mail system.

우리는 지지벡터기계를 이용하여 스팸 이메일을 자동으로 분류하는 시스템을 제안한다. 단어의 어휘 정보와 품사 태그 정보를 지지벡터기계의 자질로 사용한다. 우리는 카이 제곱 통계량을 이용하여 유용한 자질을 선택한 후 각각의 자질을 문서 빈도(TF)와 역문헌빈도(IDF) 값으로 표현하였다. 자질들을 이용하여 SVM을 학습한 후, SVM 분류기는 각각의 이메일의 스팸 유무를 결정한다. 실험 결과, 웹메일 시스템에서 수집한 이메일 데이터에 대해 약 82.7%의 정확률을 얻었다.

Proceedings of the Korean Institute of Information and Commucation Sciences Conference (한국정보통신학회:학술대회논문집)

An Automatic Spam e-mail Filter System Using χ2 Statistics and Support Vector Machines

카이 제곱 통계량과 지지벡터기계를 이용한 자동 스팸 메일 분류기

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)

An Automatic Spam e-mail Filter System Using χ² Statistics and Support Vector Machines