Browse > Article

Improving the Quality of Web Spam Filtering by Using Seed Refinement  

Qureshi, Muhammad Atif (Dept. of Computer Science, Korea Advanced Institute of Science and Technology)
Yun, Tae-Seob (Dept. of Computer Science, Korea Advanced Institute of Science and Technology)
Lee, Jeong-Hoon (Dept. of Computer Science, Korea Advanced Institute of Science and Technology)
Whang, Kyu-Young (Dept. of Computer Science, Korea Advanced Institute of Science and Technology)
Publication Information
Abstract
Web spam has a significant influence on the ranking quality of web search results because it promotes unimportant web pages. Therefore, web search engines need to filter web spam. web spam filtering is a concept that identifies spam pages - web pages contributing to web spam. TrustRank, Anti-TrustRank, Spam Mass, and Link Farm Spam are well-known web spam filtering algorithms in the research literature. The output of these algorithms depends upon the input seed. Thus, refinement in the input seed may lead to improvement in the quality of web spam filtering. In this paper, we propose seed refinement techniques for the four well-known spam filtering algorithms. Then, we modify algorithms, which we call modified spam filtering algorithms, by applying these techniques to the original ones. In addition, we propose a strategy to achieve better quality for web spam filtering. In this strategy, we consider the possibility that the modified algorithms may support one another if placed in appropriate succession. In the experiments we show the effect of seed refinement. For this goal, we first show that our modified algorithms outperform the respective original algorithms in terms of the quality of web spam filtering. Then, we show that the best succession significantly outperforms the best known original and the best modified algorithms by up to 1.38 times within typical value ranges of parameters in terms of recall while preserving precision.
Keywords
웹 스팸 필터링;입력 시드 정제;링크 스팸;성능;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Jiang, X., Xue, G., Song, W., Zeng, H., Chen, Z., and Ma, W., "Exploiting PageRank at Different Block Level," In Proc. 5th Int'l Conf. on Web Information Systems Engineering (WISE), pp. 241-252, Brisbane, Australia, Nov. 2004.
2 Duc, P.M., Heo, J., Lee, J., and Whang, K., "Ranking Quality Evaluation of PageRank Variations," Journal of the Institute of Electronics Engineers of Korea (in English), Vol.46, No.5, pp.14-28, Sept. 2009.
3 Gyongyi, Z., Berkhin, P., Garcia-Molina, H., "Web spam taxonomy," In 1st Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb) , pp. 39-47, Chiba, Japan, May 2005.
4 Salton, G., and McGill, M.J., "Introduction to Modern Information Retrieval," McGraw-Hill, 1983.
5 Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S., "A Reference Collection for Web Spam," SIGIR Forum, Vol. 40, No. 2, Dec. 2006.
6 Abernethy, J., Chapelle, O., and Castillo, C., "Web Spam Identification through Content and Hyper-links," In Proc. 4th Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 41-44, Beijing, China, Apr. 2008.
7 Chung, Y., Toyoda, M., and Kitsuregawa, M., "A study of link farm distribution and evolution using a time series of web snapshots," In Proc. 5th Int'l Workshop on Adversarial Information Retrieval on the Web(AIRWeb) , pp. 9-16, Madrid, Spain, Apr. 2009.
8 Zhou, D., Burges, C. J. C., and Tao, T., "Transductive Link Spam Detection," In Proc. 3rd Int'l Workshop on Adversarial Information Retrieval on the Web(AIRWeb) , pp. 21-28, Alberta, Canada, May 2007.
9 Langville , A.N, and Meyer, C.D., "Google PageRank and Beyond: The Science of Search Engine Rankings," Princeton University Press, Princeton, 2006.
10 Yoshida, Y., Ueda, T., Tashiro, T., Hirate, Y., and Yamana, "What's Going on in Search Engine Rankings," In Proc. 22nd Int'l Conf. on Advanced Information Networking and Applications (AINAW), pp. 1199-1204, Okinawa, Japan, Mar. 2008.
11 Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S., "Link Analysis for Web Spam Detection," ACM Transactions on Web (TWEB) , Vol. 2, No. 1, pp. 1-42, Mar. 2008.
12 Henzinger, M. R., Motwani, R., and Silverstein, C., "Challenges in Web Search Engines," SIGIR Forum, Vol. 36, No. 2, pp. 11-22, Sept. 2002.   DOI   ScienceOn
13 Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J., "Link Spam Detection Based on Mass Estimation," In Proc. 32th Int'l Conf. on Very Large Data Bases (VLDB) , pp. 439-450, Seoul, Korea, Sept. 2006.
14 Gyongyi, Z., Garcia-Molina, H., and Jan, P., "Combating Web Spam with TrustRank," In Proc. 30th Int'l Conf. on Very Large Data Bases (VLDB), pp. 576-587, Toronto, Canada, Aug. 2004.
15 Krishnan, V. and Raj, R., "Web Spam Detection With Anti-TrustRank," In 2nd Int'l Workshop on Adversarial Information Retrieval on the Web, pp. 37-40, Washington, USA, Aug. 2006.
16 Wu, B., Davison,B., "Identifying Link Farm Spam Pages," In Proc. Special interest tracks and posters of the 14th international conference on World Wide Web (WWW) , pp. 820-829, Chiba, Japan, May 2005.
17 Page, L., Brin, S., Motwani, R., and Winograd, T., The PageRank Citation Ranking: Bringing Order to the Web, Technical Report SIDL-WP-1999-0120, Department of Computer Science, Stanford University, 1998.
18 Google Search, http://www.google.com.
19 Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S., "Searching the Web," ACM Transactions on Internet Technology(TOIT), Vol. 1, No. 1, pp. 2-43, Aug. 2001.   DOI
20 Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine," In Proc. 7th Int'l Conf. on World Wide Web (WWW), pp. 107-117, Brisbane, Australia, Apr. 14-18, 1998.
21 Yahoo! Seach, http://www.yahoo.com.
22 MS Bing, http://www.bing.com.
23 Naver, http://www.naver.com.
24 Bar-Ilan, J., Mat-Hassan, M., and Levene, M., "Methods For Comparing Rankings of Search Engine Results," Computer Networks, Vol. 50, No. 10, pp. 1448-1463, 2006.   DOI   ScienceOn
25 Baeza-Yates, R., and Ribeiro-Neto, B., "Modern Information Retrieval," Addison Wesley, May 1999.
26 Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F, "Know Your Neighbors: Web Spam Detection Using the Web Topology," In Proc. 30th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 2007.