• Title/Summary/Keyword: Spam Filtering

Search Result 95, Processing Time 0.024 seconds

Analysis of filtering performance of Korean and English spam-mails (한국어와 영어 스팸메일의 필터링 성능 분석)

  • Hwang Wun-Ho;Kang Sin-Jae;Kim Tae-Hee;Kim Hee-Jae;Kim Jong-Wan
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2006.05a
    • /
    • pp.389-396
    • /
    • 2006
  • 본 연구에서는 한국어와 영어 메일을 대상으로 2단계 스팸 메일 필터링 시스템을 구축하여 성능평가를 수행한다. 2단계 스팸 메일 필터링 시스템은 블랙리스트를 활용하는 1단계와 기계학습을 통한 지능적인 분류를 하는 2단계로 구성된다. 만약 새로 도착한 메일이 블랙리스트의 내용을 포함한다면 이 메일은 스팸 메일로 분류되고 그렇지 않은 메일은 2단계로 넘어가서 스팸 메일 여부를 판단하게 된다. 메일의 본문이 영어로 작성된 영어 스팸 메일을 일반 메일로부터 분류해내기 위해서는 우선 Stemming과 Stopping 기법을 이용하여 본문에서 정형화된 어휘정보들을 추출한다. 추출된 어휘정보들을 대상으로 속성벡터를 구축한 후 SVM 기계 학습을 시켜 SVM 분류기를 생성하여 지능적인 스팸 메일 필터링을 수행한다. 속성벡터를 구축할 때 기준이 되는 자질을 어떻게 선택하느냐에 따라 스팸 메일 필터링 시스템의 성능이 좌우된다. 따라서 SYM 기계 학습을 위한 속성벡터를 구축할 때 기준이 되는 자질을 선택하는 여러 알고리즘들을 적용하여 성능을 비교 분석한다. 그리고 한국어 스팸 메일 필터링 시스템과 비교하여 영어 스팸 메일 필터링 시스템의 전체적인 성능을 비교 분석한다.

  • PDF

An Implementation and Evaluation of FQDN Check System to Filter Junk Mail (정크메일 차단을 위한 FQDN 확인 시스템의 구현 및 평가)

  • Kim Sung-Chan;Lee Sang-Hun;Jun Moon-Seog
    • The KIPS Transactions:PartC
    • /
    • v.12C no.3 s.99
    • /
    • pp.361-368
    • /
    • 2005
  • Internet mail has become a common communication method around the world because of tremendous Internet service usage increment. In other respect, Most Internet users' mail addresses are exposed to spammer, and the damage of Junk mail is growing bigger and bigger. These days, Junk mail delivery problem is becoming more serious, because this is used for an attack or propagation scheme of malicious code. It's a most dangerous dominant cause for computer system accident. This paper shows the Junk mail filtering model and implementation which is based on FQDN (Fully Qualified Domain Name) check and evaluates it for proposing advanced scheme against Junk mail.

A Classification Model for Predicting the Injured Body Part in Construction Accidents in Korea

  • Lim, Jiseon;Cho, Sungjin;Kang, Sanghyeok
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.230-237
    • /
    • 2022
  • It is difficult to predict industrial accidents in the construction industry because many accident factors, such as human-related factors and environment-related factors, affect the accidents. Many studies have analyzed the severity of injuries and types of accidents; however, there were few studies on the prediction of injured body parts. This study aims to develop a classification model to predict the part of the injured body based on accident-related factors. Construction accident cases from June 2018 to July 2021 provided by the Korea Construction Safety Management Integrated Information were collected through web crawling and then preprocessed. A naïve Bayes classifier, one of the supervised learning algorithms, was employed to construct a classification model of the injured body part, which has four categories: 1) torso, 2) upper extremity, 3) head, and 4) lower extremity. The predictor variables are accident type, type of work, facility type, injury source, and activity type. As a result, the average accuracy for each injured body part was 50.4%. The accuracy of the upper extremity and lower extremity was relatively higher than the cases of the torso and head. Unlike the other classifications, such as spam mail filtering, a naïve Bayes classifier does not provide a good classification performance in construction accidents. The reasons are discussed in the study. Based on the results of this study, more detailed guidelines for construction safety management can be provided, which help establish safety measures at the construction site.

  • PDF

Categorical Variable Selection in Naïve Bayes Classification (단순 베이즈 분류에서의 범주형 변수의 선택)

  • Kim, Min-Sun;Choi, Hosik;Park, Changyi
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.3
    • /
    • pp.407-415
    • /
    • 2015
  • $Na{\ddot{i}}ve$ Bayes Classification is based on input variables that are a conditionally independent given output variable. The $Na{\ddot{i}}ve$ Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus $Na{\ddot{i}}ve$ Bayes classier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on ${\chi}^2$ statistic on input and output variables. The proposed method retains the simplicity of $Na{\ddot{i}}ve$ Bayes classier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

A Text Mining-based Intrusion Log Recommendation in Digital Forensics (디지털 포렌식에서 텍스트 마이닝 기반 침입 흔적 로그 추천)

  • Ko, Sujeong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.6
    • /
    • pp.279-290
    • /
    • 2013
  • In digital forensics log files have been stored as a form of large data for the purpose of tracing users' past behaviors. It is difficult for investigators to manually analysis the large log data without clues. In this paper, we propose a text mining technique for extracting intrusion logs from a large log set to recommend reliable evidences to investigators. In the training stage, the proposed method extracts intrusion association words from a training log set by using Apriori algorithm after preprocessing and the probability of intrusion for association words are computed by combining support and confidence. Robinson's method of computing confidences for filtering spam mails is applied to extracting intrusion logs in the proposed method. As the results, the association word knowledge base is constructed by including the weights of the probability of intrusion for association words to improve the accuracy. In the test stage, the probability of intrusion logs and the probability of normal logs in a test log set are computed by Fisher's inverse chi-square classification algorithm based on the association word knowledge base respectively and intrusion logs are extracted from combining the results. Then, the intrusion logs are recommended to investigators. The proposed method uses a training method of clearly analyzing the meaning of data from an unstructured large log data. As the results, it complements the problem of reduction in accuracy caused by data ambiguity. In addition, the proposed method recommends intrusion logs by using Fisher's inverse chi-square classification algorithm. So, it reduces the rate of false positive(FP) and decreases in laborious effort to extract evidences manually.