• Title/Summary/Keyword: URL normalization

Search Result 4, Processing Time 0.021 seconds

URL Signatures for Improving URL Normalization (URL 정규화 향상을 위한 URL 서명)

  • Soon, Lay-Ki;Lee, Sang-Ho
    • Journal of KIISE:Databases
    • /
    • v.36 no.2
    • /
    • pp.139-149
    • /
    • 2009
  • In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to complement the standard URL normalization by incorporating the semantically meaningful metadata of the web pages. The metadata taken into consideration are the body texts and the page size of the web pages, which can be extracted during HTML parsing. The results from our first exploratory experiment indicate that the body texts are effective in identifying equivalent URLs. Hence, given a URL which has undergone the standard normalization, we construct its URL signature by hashing the body text of the associated web page using Message-Digest algorithm 5 in the second experiment. URLs which share identical signatures are considered to be equivalent in our scheme. The results in the second experiment show that our proposed URL signatures were able to further reduce redundant URLs by 32.94% in comparison with the standard URL normalization.

URL Normalization for Web Applications (웹 어플리케이션을 위한 URL 정규화)

  • Hong, Seok-Hoo;Kim, Sung-Jin;Lee, Sang-Ho
    • Journal of KIISE:Information Networking
    • /
    • v.32 no.6
    • /
    • pp.716-722
    • /
    • 2005
  • In the m, syntactically different URLs could represent the same resource. The URL normalization is a process that transform a URL, syntactically different and represent the same resource, into canonical form. There are on-going efforts to define standard URL normalization. The standard URL normalization designed to minimize false negative while strictly avoiding false positive. This paper considers the four URL normalization issues beyond ones specified in the standard URL normalization. The idea behind our work is that in the URL normalization we want to minimize false negatives further while allowing false positives in a limited level. Two metrics are defined to analyze the effect of each step in the URL normalization. Over 170 million URLs that were collected in the real web pages, we did an experiment, and interesting statistical results are reported in this paper.

Effects and Evaluations of URL Normalization (URL정규화의 적용 효과 및 평가)

  • Jeong, Hyo-Sook;Kim, Sung-Jin;Lee, Sang-Ho
    • Journal of KIISE:Databases
    • /
    • v.33 no.5
    • /
    • pp.486-494
    • /
    • 2006
  • A web page can be represented by syntactically different URLs. URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for a web page can be reduced significantly. A number of normalization methods have been heuristically developed and used, and there has been no study on analyzing the normalization methods systematically. In this paper, we give a way to evaluate normalization methods in terms of efficiency and effectiveness of web applications, and give users guidelines for selecting appropriate methods. To this end, we examine all the effects that can take place when a normalization method is adopted to web applications, and describe seven metrics for evaluating normalization methods. Lastly, the evaluation results on 12 normalization methods with the 25 million actual URLs are reported.

Evaluating Site-based URL Normalization (사이트 기반의 URL 정규화 평가)

  • Jeong, Hyo-Sook;Kim, Sung-Jin;Lee, Sang-Ho
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.11b
    • /
    • pp.28-30
    • /
    • 2005
  • URL 정규화는 다양하게 표현된 동일 URL들을 하나의 통일된(cannonical) 형태의 URL로 변환하는 과정이다. 동일문서에 대한 중복된 URL 표현은 URL 정규화를 통하여 제거된다. 표준 정규화는 잘못된 긍정(동일하지 않는 URL들을 동일 문자열로 변환)이 없도록 개발되었다. 그러나 표준 정규화는 많은 잘못된 부정이 발생하게 되므로, 잘못된 긍정을 일부 허용하면서 잘못된 부정을 현격히 줄일 수 있는 확장 정규화가 제기되고 연구되어 왔다. 본 논문에서는 동일 사이트 내의 URL들에 대한 확장 정규화의 적용 결과가 유사한 정도를 보임으로써, 한 사이트 내의 URL에 대한 임의의 확장 정규화 결과 정보가 동일 사이트 내의 다른 URL들의 정규화에 효과적으로 사용될 수 있음을 보인다. 이를 위하여, 한 사이트의 확장 정규화 결과 동일성 척도와 사이트 기반의 확장 정규화 평가 척도를 제안한다. 20,000만개의 실제 국내 웹 사이트에서 추출된 25만개의 URL에 대해 6가지 확장 정규화가 평가된다.

  • PDF