Browse > Article

Splog Detection Using Post Structure Similarity and Daily Posting Count  

Beak, Jee-Hyun (중앙대학교 컴퓨터공학과)
Cho, Jung-Sik (중앙대학교 컴퓨터공학과)
Kim, Sung-Kwon (중앙대학교 컴퓨터공학과)
Abstract
A blog is a website, usually maintained by an individual, with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in reverse chronological order. Blog search engines, like web search engines, seek information for searchers on blogs. Blog search engines sometimes output unsatisfactory results, mainly due to spam blogs or splogs. Splogs are blogs hosting spam posts, plagiarized or auto-generated contents for the sole purpose of hosting advertizements or raising the search rankings of target sites. This thesis focuses on splog detection. This thesis proposes a new splog detection method, which is based on blog post structure similarity and posting count per day. Experiments based on methods proposed a day show excellent result on splog detection tasks with over 90% accuracy.
Keywords
Web; Blog; Splog; Web Spam;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, Anupam Joshi, "Detecting Spam Blogs: A Machine Learning Approach," Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), 2006.
2 Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi, "Towards Spam Detection at Ping Servers," ICWSM 2007, 2007.
3 Wikipedia, "Spamdexing," Online at http://en.wikipedia.org/wiki/Spamdexing
4 Wikipedia, "Spam in Blogs," Online at http://en.wikipedia.org/wiki/Spam_in_blogs
5 Zoltan Gyongyi, Hector Garcia-Molina, "Web Spam Taxonomy," 30th International Conference on Very Large Data Bases (VLDB 2004), 2004.
6 Wikipedia, "K-fold cross-validation," Online at http://en.wikipedia.org/wiki/Cross_validation#K-fold _cross-validation
7 Wikipedia, "Spam Blog," Online at http://en.wikipedia.org/wiki/Spam_blog
8 Thorsten Joachims, "SVMlight," http://svmlight.joachims.org/, 2004.
9 Pranam Kolari and Akshay Java and Tim Finin, "Characterizing the Splogosphere," In WWW 2006, 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
10 Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura, Belle Tseng, "Splog Detection Using Selfsimilarity Analysis on Blog Temporal Dynamics," AIRWeb 2007, 2007.
11 Wikipedia, "blog," Online at http://en.wikipedia.org/wiki/Blog
12 Dennis Fetterly, Mark Manasse, Marc Najork, "Spam, Damn Spam, and Statistics," Seventh International Workshop on the Web and Databases (WebDB 2004), 2004.