Improving the Quality of Web Spam Filtering by Using Seed Refinement

시드 정제 기술을 이용한 웹 스팸 필터링의 품질 향상

  • Qureshi, Muhammad Atif (Dept. of Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Yun, Tae-Seob (Dept. of Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Lee, Jeong-Hoon (Dept. of Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Whang, Kyu-Young (Dept. of Computer Science, Korea Advanced Institute of Science and Technology)
  • Received : 2011.07.20
  • Published : 2011.11.25

Abstract

Web spam has a significant influence on the ranking quality of web search results because it promotes unimportant web pages. Therefore, web search engines need to filter web spam. web spam filtering is a concept that identifies spam pages - web pages contributing to web spam. TrustRank, Anti-TrustRank, Spam Mass, and Link Farm Spam are well-known web spam filtering algorithms in the research literature. The output of these algorithms depends upon the input seed. Thus, refinement in the input seed may lead to improvement in the quality of web spam filtering. In this paper, we propose seed refinement techniques for the four well-known spam filtering algorithms. Then, we modify algorithms, which we call modified spam filtering algorithms, by applying these techniques to the original ones. In addition, we propose a strategy to achieve better quality for web spam filtering. In this strategy, we consider the possibility that the modified algorithms may support one another if placed in appropriate succession. In the experiments we show the effect of seed refinement. For this goal, we first show that our modified algorithms outperform the respective original algorithms in terms of the quality of web spam filtering. Then, we show that the best succession significantly outperforms the best known original and the best modified algorithms by up to 1.38 times within typical value ranges of parameters in terms of recall while preserving precision.

웹 스팸은 중요하지 않은 웹 페이지들의 중요도를 승격시키기 때문에 웹 검색 결과의 품질에 중대한 영향을 준다. 따라서 웹 검색 엔진은 웹 스팸을 제거할 필요가 있다. 웹 스팸 필터링은 스팸 페이지들, 즉 웹 스팸에 기여하는 웹 페이지들을 식별하는 것이며, 잘 알려진 웹 스팸 필터링 알고리즘으로는 Trust Rank, Anti-Trust Rank, Spam Mass, 그리고 Link Farm Spam이 있다. 이러한 알고리즘들의 결과 품질은 입력 시드(input seed)에 따라 달라진다. 따라서 입력 시드를 정제(refinement) 함으로써, 웹 스팸 필터링의 품질을 향상 시킬 수 있다. 본 논문에서는 잘 알려진 네 가지 알고리즘에 대한 시드를 정제하는 기술을 제안한다. 다음으로, 이러한 기술을 원(original) 알고리즘에 각각 적용하는 방법으로 알고리즘을 수정한다. 이를 수정된 웹 스팸 필터링 알고리즘이라고 부른다. 본 논문에서는 또한, 웹 스팸 필터링을 좀 더 향상시키기 위한 전략을 제안한다. 이 전략에서는 수정된 알고리즘들을 수행 순서상의 적절한 위치에 배치함으로써 알고리즘들의 상호간 지원을 통해 전체적으로 성능을 향상시키는 가능성을 고려한다. 마지막으로, 실험에서는 시드 정제의 효과를 보인다. 이를 위해, 먼저, 수정된 알고리즘의 웹 스팸 필터링 품질이 원 알고리즘의 품질보다 더 우수함을 보인다. 다음으로, 웹 스팸 필터링 알고리즘들이 수행되는 순서의 조합 중 가장 성능이 우수한 조합이 가장 뛰어난 잘 알려진 알고리즘과 비교하여 정확도(precision)를 유지하면서 파라미터의 전형적인 값 범위 내에서 재현율(recall)은 최대 1.38배까지 높게 향상됨을 보인다.

Keywords

References

  1. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S., "Searching the Web," ACM Transactions on Internet Technology(TOIT), Vol. 1, No. 1, pp. 2-43, Aug. 2001. https://doi.org/10.1145/383034.383035
  2. Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine," In Proc. 7th Int'l Conf. on World Wide Web (WWW), pp. 107-117, Brisbane, Australia, Apr. 14-18, 1998.
  3. Google Search, http://www.google.com.
  4. Yahoo! Seach, http://www.yahoo.com.
  5. MS Bing, http://www.bing.com.
  6. Naver, http://www.naver.com.
  7. Bar-Ilan, J., Mat-Hassan, M., and Levene, M., "Methods For Comparing Rankings of Search Engine Results," Computer Networks, Vol. 50, No. 10, pp. 1448-1463, 2006. https://doi.org/10.1016/j.comnet.2005.10.020
  8. Baeza-Yates, R., and Ribeiro-Neto, B., "Modern Information Retrieval," Addison Wesley, May 1999.
  9. Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F, "Know Your Neighbors: Web Spam Detection Using the Web Topology," In Proc. 30th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 2007.
  10. Yoshida, Y., Ueda, T., Tashiro, T., Hirate, Y., and Yamana, "What's Going on in Search Engine Rankings," In Proc. 22nd Int'l Conf. on Advanced Information Networking and Applications (AINAW), pp. 1199-1204, Okinawa, Japan, Mar. 2008.
  11. Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S., "Link Analysis for Web Spam Detection," ACM Transactions on Web (TWEB) , Vol. 2, No. 1, pp. 1-42, Mar. 2008.
  12. Henzinger, M. R., Motwani, R., and Silverstein, C., "Challenges in Web Search Engines," SIGIR Forum, Vol. 36, No. 2, pp. 11-22, Sept. 2002. https://doi.org/10.1145/792550.792553
  13. Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J., "Link Spam Detection Based on Mass Estimation," In Proc. 32th Int'l Conf. on Very Large Data Bases (VLDB) , pp. 439-450, Seoul, Korea, Sept. 2006.
  14. Gyongyi, Z., Garcia-Molina, H., and Jan, P., "Combating Web Spam with TrustRank," In Proc. 30th Int'l Conf. on Very Large Data Bases (VLDB), pp. 576-587, Toronto, Canada, Aug. 2004.
  15. Krishnan, V. and Raj, R., "Web Spam Detection With Anti-TrustRank," In 2nd Int'l Workshop on Adversarial Information Retrieval on the Web, pp. 37-40, Washington, USA, Aug. 2006.
  16. Wu, B., Davison,B., "Identifying Link Farm Spam Pages," In Proc. Special interest tracks and posters of the 14th international conference on World Wide Web (WWW) , pp. 820-829, Chiba, Japan, May 2005.
  17. Page, L., Brin, S., Motwani, R., and Winograd, T., The PageRank Citation Ranking: Bringing Order to the Web, Technical Report SIDL-WP-1999-0120, Department of Computer Science, Stanford University, 1998.
  18. Jiang, X., Xue, G., Song, W., Zeng, H., Chen, Z., and Ma, W., "Exploiting PageRank at Different Block Level," In Proc. 5th Int'l Conf. on Web Information Systems Engineering (WISE), pp. 241-252, Brisbane, Australia, Nov. 2004.
  19. Duc, P.M., Heo, J., Lee, J., and Whang, K., "Ranking Quality Evaluation of PageRank Variations," Journal of the Institute of Electronics Engineers of Korea (in English), Vol.46, No.5, pp.14-28, Sept. 2009.
  20. Gyongyi, Z., Berkhin, P., Garcia-Molina, H., "Web spam taxonomy," In 1st Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb) , pp. 39-47, Chiba, Japan, May 2005.
  21. Salton, G., and McGill, M.J., "Introduction to Modern Information Retrieval," McGraw-Hill, 1983.
  22. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S., "A Reference Collection for Web Spam," SIGIR Forum, Vol. 40, No. 2, Dec. 2006.
  23. Abernethy, J., Chapelle, O., and Castillo, C., "Web Spam Identification through Content and Hyper-links," In Proc. 4th Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 41-44, Beijing, China, Apr. 2008.
  24. Chung, Y., Toyoda, M., and Kitsuregawa, M., "A study of link farm distribution and evolution using a time series of web snapshots," In Proc. 5th Int'l Workshop on Adversarial Information Retrieval on the Web(AIRWeb) , pp. 9-16, Madrid, Spain, Apr. 2009.
  25. Zhou, D., Burges, C. J. C., and Tao, T., "Transductive Link Spam Detection," In Proc. 3rd Int'l Workshop on Adversarial Information Retrieval on the Web(AIRWeb) , pp. 21-28, Alberta, Canada, May 2007.
  26. Langville , A.N, and Meyer, C.D., "Google PageRank and Beyond: The Science of Search Engine Rankings," Princeton University Press, Princeton, 2006.