Browse > Article

Effects and Evaluations of URL Normalization  

Jeong, Hyo-Sook (숭실대학교 컴퓨터학과)
Kim, Sung-Jin (서울대학교 전기컴퓨터공학부)
Lee, Sang-Ho (숭실대학교 컴퓨터학부)
Abstract
A web page can be represented by syntactically different URLs. URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for a web page can be reduced significantly. A number of normalization methods have been heuristically developed and used, and there has been no study on analyzing the normalization methods systematically. In this paper, we give a way to evaluate normalization methods in terms of efficiency and effectiveness of web applications, and give users guidelines for selecting appropriate methods. To this end, we examine all the effects that can take place when a normalization method is adopted to web applications, and describe seven metrics for evaluating normalization methods. Lastly, the evaluation results on 12 normalization methods with the 25 million actual URLs are reported.
Keywords
URL; URL normalization; URL normalization evaluation;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Netcraft., 'Web Server Survey,' http://news.netcraft.com/archives/web_server _survey.html, 2004
2 Lee, S.H., Kim, S.J., and Hong, S.H., 'On URL Normalization,' Springer-Verlag Lecture Notes in Computer Science, Vol.3481, Part II, pp. 1076-1085, 2005   DOI   ScienceOn
3 Shkapenyuk, V. and Suel, T., 'Design and Implementation of a High-performance Distributed Web Crawler,' In Proceedings of 18th Data Engineering Conference, pp. 357-368, 2002
4 Berners-Lee, T., Fielding, R., and Masinter, L., 'Uniform Resource Identifiers (URI): Generic Syntax,' http://gbiv.com/protocols/urilrfc/rfc2396. html, 2005
5 Burner, M., 'Crawling Towards Eternity: Building an Archive of the World Wide Web,' Web Techniques Magazine, Vol.2, No.5, pp. 37-40, 1997
6 Kim, S.J. and Lee, S.H., 'Implementation of a Web Robot and Statistics on the Korean Web,' Springer-Verlag Lecture Notes in Computer Science, Vol.2713, pp. 341-350, 2003
7 Heydon, A. and Najork, M., 'Mercator: A Scalable, Extensible Web Crawler,' International Journal of WWW.Vol.2.No.4. pp. 219-229, 1999   DOI