• 제목/요약/키워드: Web Spam

Search Result 37, Processing Time 0.023 seconds

Improving the Quality of Web Spam Filtering by Using Seed Refinement (시드 정제 기술을 이용한 웹 스팸 필터링의 품질 향상)

  • Qureshi, Muhammad Atif;Yun, Tae-Seob;Lee, Jeong-Hoon;Whang, Kyu-Young
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.6
    • /
    • pp.123-139
    • /
    • 2011
  • Web spam has a significant influence on the ranking quality of web search results because it promotes unimportant web pages. Therefore, web search engines need to filter web spam. web spam filtering is a concept that identifies spam pages - web pages contributing to web spam. TrustRank, Anti-TrustRank, Spam Mass, and Link Farm Spam are well-known web spam filtering algorithms in the research literature. The output of these algorithms depends upon the input seed. Thus, refinement in the input seed may lead to improvement in the quality of web spam filtering. In this paper, we propose seed refinement techniques for the four well-known spam filtering algorithms. Then, we modify algorithms, which we call modified spam filtering algorithms, by applying these techniques to the original ones. In addition, we propose a strategy to achieve better quality for web spam filtering. In this strategy, we consider the possibility that the modified algorithms may support one another if placed in appropriate succession. In the experiments we show the effect of seed refinement. For this goal, we first show that our modified algorithms outperform the respective original algorithms in terms of the quality of web spam filtering. Then, we show that the best succession significantly outperforms the best known original and the best modified algorithms by up to 1.38 times within typical value ranges of parameters in terms of recall while preserving precision.

Comparing Feature Selection Methods in Spam Mail Filtering

  • Kim, Jong-Wan;Kang, Sin-Jae
    • Proceedings of the Korea Society of Information Technology Applications Conference
    • /
    • 2005.11a
    • /
    • pp.17-20
    • /
    • 2005
  • In this work, we compared several feature selection methods in the field of spam mail filtering. The proposed fuzzy inference method outperforms information gain and chi squared test methods as a feature selection method in terms of error rate. In the case of junk mails, since the mail body has little text information, it provides insufficient hints to distinguish spam mails from legitimate ones. To address this problem, we follow hyperlinks contained in the email body, fetch contents of a remote web page, and extract hints from both original email body and fetched web pages. A two-phase approach is applied to filter spam mails in which definite hint is used first, and then less definite textual information is used. In our experiment, the proposed two-phase method achieved an improvement of recall by 32.4% on the average over the $1^{st}$ phase or the $2^{nd}$ phase only works.

  • PDF

A Method to Block Spam Mail Automatically Through the Connection to Link URL (링크 유알엘 접속을 통한 스팸메일 자동 차단 방법에 관한 연구)

  • Jung, Nam-Cheol
    • Journal of Digital Contents Society
    • /
    • v.8 no.4
    • /
    • pp.451-458
    • /
    • 2007
  • In this paper, I developed a method whereby spam mail is automatically blocked through the connection to link URL. The blocking system works as follows. First, the system extracts information of URL linked to electronic mail which was delivered from any server on the internet. Next, the system lets itself be connected to the web pages through this URL. Last, the system blocks the electronic mail if those web pages contain any key word which was defined as a clue to spam mail.

  • PDF

Spam Message Filtering with Bayesian Approach for Internet Communities (베이지안을 이용한 인터넷 커뮤니티 상의 유해 메시지 차단 기법)

  • Kim, Bum-Bae;Choi, Hyoung-Kee
    • The KIPS Transactions:PartC
    • /
    • v.13C no.6 s.109
    • /
    • pp.733-740
    • /
    • 2006
  • Spam Message has been Causing widespread damages on the Internet. One source of the problems is rooted from an anonymously posted message in the bulletin board in Internet communities. This type of the Spam messages tries to advertise products, to harm other's reputation, to deliver religious messages and so on. In this paper we present the Spam message filtering using the Bayesian approach. In order to increase usefulness of the Spam filter in the bulletin board in Internet communities, we made the Spam filter which can divide the Spam message into six categories such as advertisement, pornography, abuse, religion and other. The test conducted against messages posted on the popular web sites.

Semantics in Social Web: A Case of Personalized Email Marketing (소셜 웹에서의 시맨틱스: 개인화 이메일 마케팅 개발 사례)

  • Joo, Jae-Hun;Myeong, Sung-Jae
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.6
    • /
    • pp.43-48
    • /
    • 2010
  • Useful emails influence on consumers' purchase behavior and activate them to visit retail stores. Regular contact with consumers by e-mail has positive effects on brand loyalty. However, email marketing has a limitation. Spam now accounts for over half of all e-mail traffic. The increase of email users has resulted in the dramatic increase of spam emails during the past few years. In this paper, we proposed an ontology-based system offering personalized email services to overcome such limitation. Our method is not the ontology-driven spam filtering, but a personalized content service considering personal interests and relations among people by using FOAF and domain ontologies. Our system was successfully tested in email marketing domain.

A Crowdsourcing-Based Paraphrased Opinion Spam Dataset and Its Implication on Detection Performance (크라우드소싱 기반 문장재구성 방법을 통한 의견 스팸 데이터셋 구축 및 평가)

  • Lee, Seongwoon;Kim, Seongsoon;Park, Donghyeon;Kang, Jaewoo
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.7
    • /
    • pp.338-343
    • /
    • 2016
  • Today, opinion reviews on the Web are often used as a means of information exchange. As the importance of opinion reviews continues to grow, the number of issues for opinion spam also increases. Even though many research studies on detecting spam reviews have been conducted, some limitations of gold-standard datasets hinder research. Therefore, we introduce a new dataset called "Paraphrased Opinion Spam (POS)" that contains a new type of review spam that imitates truthful reviews. We have noticed that spammers refer to existing truthful reviews to fabricate spam reviews. To create such a seemingly truthful review spam dataset, we asked task participants to paraphrase truthful reviews to create a new deceptive review. The experiment results show that classifying our POS dataset is more difficult than classifying the existing spam datasets since the reviews in our dataset more linguistically look like truthful reviews. Also, training volume has been found to be an important factor for classification model performance.

Detecting Spam Data for Securing the Reliability of Text Analysis (텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안)

  • Hyun, Yoonjin;Kim, Namgyu
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.493-504
    • /
    • 2017
  • Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.

Facebook Spam Post Filtering based on Instagram-based Transfer Learning and Meta Information of Posts (인스타그램 기반의 전이학습과 게시글 메타 정보를 활용한 페이스북 스팸 게시글 판별)

  • Kim, Junhong;Seo, Deokseong;Kim, Haedong;Kang, Pilsung
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.43 no.3
    • /
    • pp.192-202
    • /
    • 2017
  • This study develops a text spam filtering system for Facebook based on two variable categories: keywords learned from Instagram and meta-information of Facebook posts. Since there is no explicit labels for spam/ham posts, we utilize hash tags in Instagram to train classification models. In addition, the filtering accuracy is enhanced by considering meta-information of Facebook posts. To verify the proposed filtering system, we conduct an empirical experiment based on a total of 1,795,067 and 761,861 Facebook and Instagram documents, respectively. Employing random forest as a base classification algorithm, experimental result shows that the proposed filtering system yield 99% and 98% in terms of filtering accuracy and F1-measure, respectively. We expect that the proposed filtering scheme can be applied other web services suffering from massive spam posts but no explicit spam labels are available.

An Automatic Spam e-mail Filter System Using χ2 Statistics and Support Vector Machines (카이 제곱 통계량과 지지벡터기계를 이용한 자동 스팸 메일 분류기)

  • Lee, Songwook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2009.05a
    • /
    • pp.592-595
    • /
    • 2009
  • We propose an automatic spam mail classifier for e-mail data using Support Vector Machines (SVM). We use a lexical form of a word and its part of speech (POS) tags as features. We select useful features with ${\chi}^2$ statistics and represent each feature using text frequency (TF) and inversed document frequency (IDF) values for each feature. After training SVM with the features, SVM classifies each email as spam mail or not. In experiment, we acquired 82.7% of accuracy with e-mail data collected from a web mail system.

  • PDF

Characterization of Web Spam through the Korean Web Analysis (국내 웹 분석을 통한 웹 스팸의 특성)

  • Choi, Seung-Jin;Kim, Sung-Kwon
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10d
    • /
    • pp.333-338
    • /
    • 2007
  • 웹 스팸(Web Spam)은 스패머가 원하는 페이지를 검색 결과 상단에 올리는 기술이다. 이러한 웹 스팸에 의해 상위 랭크된 페이지는 사용자에게 올바른 정보를 전달해 주지 않는다. 해외에서는 웹 스팸의 심각성을 인식하고 이에 대한 연구 또한 활발히 진행되고 있다. 하지만 국내의 경우 아직 웹 스팸에 대하 연구가 미흡한 실정이다. 또한 해외에서 연구되고 있는 웹 스팸 탐지 기술들은 국내의 웹에 적용시키기 힘들다. 그래서 본 논문은 다양한 방식으로 국내 웹과 검색 사이트의 특성을 분석하고 해외와의 차이점에 대해 알아본다. 그리고 이 차이점을 통해 국내 웹에서 나타날 수 있는 웹 스팸과 앞으로의 연구 방향에 도움을 주고자 한다.

  • PDF