Browse > Article

URL Signatures for Improving URL Normalization  

Soon, Lay-Ki (숭실대학교 컴퓨터학부)
Lee, Sang-Ho (숭실대학교 컴퓨터학부)
Abstract
In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to complement the standard URL normalization by incorporating the semantically meaningful metadata of the web pages. The metadata taken into consideration are the body texts and the page size of the web pages, which can be extracted during HTML parsing. The results from our first exploratory experiment indicate that the body texts are effective in identifying equivalent URLs. Hence, given a URL which has undergone the standard normalization, we construct its URL signature by hashing the body text of the associated web page using Message-Digest algorithm 5 in the second experiment. URLs which share identical signatures are considered to be equivalent in our scheme. The results in the second experiment show that our proposed URL signatures were able to further reduce redundant URLs by 32.94% in comparison with the standard URL normalization.
Keywords
URL normalization; URL signatures; web pages crawling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Pant, G., Srinivasan, P., Menczer, F., 'Crawling the Web,' Web Dynamics 2004, pp, 153-178
2 Bar-Yossef, Z., Keidar, I., Schonfeld, U., 'Do Not Crawl in the DUST: Different URLs with Similar Text,' in the Proceedings of the International World Wide web Conference (WWW 2007), pp. 111 - 120, May 2007   DOI
3 Burner M., 'Crawling Towards Eternity: Building an archive of the World Wide Web,' Web Techniques Magazine, 2(5), May 1997
4 Han, J., Kamber, M., Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, Elsevier, San Francisco, CA, 2006
5 Chakrabarti, S., Mining the web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier, San Francisco, CA, 2003
6 Kim, S. J., Jeong, H. S., and Lee, S. H., 'Reliable Evaluations of URI. Normalization,' in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609-617, May 2006   DOI   ScienceOn
7 Netcraft June 2008 Web Server Survey, available at: http://news.netcraft.com/archives/web_server_survey.html
8 Berners-Lee, T., Fielding, R, Masinter, L., 'Uniform Resource Identifier (URI): General Syntax,' available at Hhttp://gbiv.com/protocols/uri/rfc/rfc 3986.htmlH.
9 The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfcl321
10 Soon, L. K. and Lee, S. R., 'Identifying Equivalent URI.s using URI. Signatures,' to appear in the Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet- Based Systems (SITIS 2008), Bali, Indonesia, December 2008   DOI
11 Lee, S. H., Kim, S. J, Hong, S. H., 'On URL Normalization,' in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp, 1076-1085, May 2005   DOI   ScienceOn
12 Web Data Extractor, available at: http://www.webextractor.corn/