URL Signatures for Improving URL Normalization

URL 정규화 향상을 위한 URL 서명

  • Published : 2009.04.15

Abstract

In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to complement the standard URL normalization by incorporating the semantically meaningful metadata of the web pages. The metadata taken into consideration are the body texts and the page size of the web pages, which can be extracted during HTML parsing. The results from our first exploratory experiment indicate that the body texts are effective in identifying equivalent URLs. Hence, given a URL which has undergone the standard normalization, we construct its URL signature by hashing the body text of the associated web page using Message-Digest algorithm 5 in the second experiment. URLs which share identical signatures are considered to be equivalent in our scheme. The results in the second experiment show that our proposed URL signatures were able to further reduce redundant URLs by 32.94% in comparison with the standard URL normalization.

URL은 표준 URL 정규화에서 정의한 단계에 의하여 구문적으로 정규화된다. 본 논문에서는 웹 페이지의 메타데이타를 이용하여 표준 URL 정규화를 보완하는 기법을 제안한다. 메타데이타는 HTML 분석 도중 추출될 수 있는 웹 페이지 본문과 페이지 크기이다. 첫 번째 실험에서는 웹 페이지 본문이 동등한 URL 식별에 효과적이라는 것을 보인다. 두 번째 실험에서는 웹 페이지 본문을 Message-Digest 5 알고리즘으로 해싱하여 URL 서명을 만들며, 동일한 서명을 가지는 URL은 동일하게 취급한다. 두 번째 실험 결과에서, 우리가 제시한 URL 서명이 표준 URL 정규화와 비교하여 32.94%의 중복 URL을 더 감소시킬 수 있음을 알 수 있었다.

Keywords

References

  1. Berners-Lee, T., Fielding, R, Masinter, L., 'Uniform Resource Identifier (URI): General Syntax,' available at Hhttp://gbiv.com/protocols/uri/rfc/rfc 3986.htmlH.
  2. Lee, S. H., Kim, S. J, Hong, S. H., 'On URL Normalization,' in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp, 1076-1085, May 2005 https://doi.org/10.1007/11424826_115
  3. Pant, G., Srinivasan, P., Menczer, F., 'Crawling the Web,' Web Dynamics 2004, pp, 153-178
  4. Kim, S. J., Jeong, H. S., and Lee, S. H., 'Reliable Evaluations of URI. Normalization,' in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609-617, May 2006 https://doi.org/10.1007/11751649_67
  5. Bar-Yossef, Z., Keidar, I., Schonfeld, U., 'Do Not Crawl in the DUST: Different URLs with Similar Text,' in the Proceedings of the International World Wide web Conference (WWW 2007), pp. 111 - 120, May 2007 https://doi.org/10.1145/1242572.1242588
  6. Netcraft June 2008 Web Server Survey, available at: http://news.netcraft.com/archives/web_server_survey.html
  7. Burner M., 'Crawling Towards Eternity: Building an archive of the World Wide Web,' Web Techniques Magazine, 2(5), May 1997
  8. Chakrabarti, S., Mining the web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier, San Francisco, CA, 2003
  9. The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfcl321
  10. Web Data Extractor, available at: http://www.webextractor.corn/
  11. Han, J., Kamber, M., Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, Elsevier, San Francisco, CA, 2006
  12. Soon, L. K. and Lee, S. R., 'Identifying Equivalent URI.s using URI. Signatures,' to appear in the Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet- Based Systems (SITIS 2008), Bali, Indonesia, December 2008 https://doi.org/10.1109/SITIS.2008.21