• Title/Summary/Keyword: Web page similarity

Search Result 17, Processing Time 0.031 seconds

Measuring Web Page Similarity using Tags (태그를 이용한 웹 페이지간의 유사도 측정 방법)

  • Kang, Sang-Wook;Lee, Ki-Yong;Kim, Hyeon-Gyu;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.37 no.2
    • /
    • pp.104-112
    • /
    • 2010
  • Social bookmarking is one of the most interesting trends in the current web environment. In a social bookmarking system, users annotate a web page with tags, which describe the contents of the page. Numerous studies have been done using this information, mostly on enhancing the quality of web search. In this paper, we use this information to measure the semantic similarity between two web pages. Since web pages consist of various types of multimedia data, it is quite difficult to compare the semantics of two web pages by comparing the actual data contained in the pages. With the help of social bookmarks, this comparison can be performed very effectively. In this paper, we propose a new similarity measure between web pages, called Web Page Similarity Based on Entire Tags (WSET), based on social bookmarks. The experimental results show that the proposed measure yields more satisfactory results than the previous ones.

An Effective Metric for Measuring the Degree of Web Page Changes (효과적인 웹 문서 변경도 측정 방법)

  • Kwon, Shin-Young;Kim, Sung-Jin;Lee, Sang-Ho
    • Journal of KIISE:Databases
    • /
    • v.34 no.5
    • /
    • pp.437-447
    • /
    • 2007
  • A variety of similarity metrics have been used to measure the degree of web page changes. In this paper, we first define criteria for web page changes to evaluate the effectiveness of the similarity metrics in terms of six important types of web page changes. Second, we propose a new similarity metric appropriate for measuring the degree of web page changes. Using real web pages and synthesized pages, we analyze the five existing metrics (i.e., the byte-wise comparison, the TF IDF cosine distance, the word distance, the edit distance, and the shingling) and ours under the proposed criteria. The analysis result shows that our metric represents the changes more effectively than other metrics. We expect that our study can help users select an appropriate metric for particular web applications.

Detecting Intentionally Biased Web Pages In terms of Hypertext Information (하이퍼텍스트 정보 관점에서 의도적으로 왜곡된 웹 페이지의 검출에 관한 연구)

  • Lee Woo Key
    • Journal of the Korea Society of Computer and Information
    • /
    • v.10 no.1 s.33
    • /
    • pp.59-66
    • /
    • 2005
  • The organization of the web is progressively more being used to improve search and analysis of information on the web as a large collection of heterogeneous documents. Most people begin at a Web search engine to find information. but the user's pertinent search results are often greatly diluted by irrelevant data or sometimes appear on target but still mislead the user in an unwanted direction. One of the intentional, sometimes vicious manipulations of Web databases is a intentionally biased web page like Google bombing that is based on the PageRank algorithm. one of many Web structuring techniques. In this thesis, we regard the World Wide Web as a directed labeled graph that Web pages represent nodes and link edges. In the Present work, we define the label of an edge as having a link context and a similarity measure between link context and target page. With this similarity, we can modify the transition matrix of the PageRank algorithm. By suggesting a motivating example, it is explained how our proposed algorithm can filter the Web intentionally biased web Pages effective about $60\%% rather than the conventional PageRank.

  • PDF

Web Page Similarity based on Size and Frequency of Tokens (토큰 크기 및 출현 빈도에 기반한 웹 페이지 유사도)

  • Lee, Eun-Joo;Jung, Woo-Sung
    • Journal of Information Technology Services
    • /
    • v.11 no.4
    • /
    • pp.263-275
    • /
    • 2012
  • It is becoming hard to maintain web applications because of high complexity and duplication of web pages. However, most of research about code clone is focusing on code hunks, and their target is limited to a specific language. Thus, we propose GSIM, a language-independent statistical approach to detect similar pages based on scarcity and frequency of customized tokens. The tokens, which can be obtained from pages splitted by a set of given separators, are defined as atomic elements for calculating similarity between two pages. In this paper, the domain definition for web applications and algorithms for collecting tokens, making matrics, calculating similarity are given. We also conducted experiments on open source codes for evaluation, with our GSIM tool. The results show the applicability of the proposed method and the effects of parameters such as threshold, toughness, length of tokens, on their quality and performance.

PageRank Algorithm Using Link Context (링크내역을 이용한 페이지점수법 알고리즘)

  • Lee, Woo-Key;Shin, Kwang-Sup;Kang, Suk-Ho
    • Journal of KIISE:Databases
    • /
    • v.33 no.7
    • /
    • pp.708-714
    • /
    • 2006
  • The World Wide Web has become an entrenched global medium for storing and searching information. Most people begin at a Web search engine to find information, but the user's pertinent search results are often greatly diluted by irrelevant data or sometimes appear on target but still mislead the user in an unwanted direction. One of the intentional, sometimes vicious manipulations of Web databases is Web spamming as Google bombing that is based on the PageRank algorithm, one of the most famous Web structuring techniques. In this paper, we regard the Web as a directed labeled graph that Web pages represent nodes and the corresponding hyperlinks edges. In the present work, we define the label of an edge as having a link context and a similarity measure between link context and the target page. With this similarity, we can modify the transition matrix of the PageRank algorithm. A motivating example is investigated in terms of the Singular Value Decomposition with which our algorithm can outperform to filter the Web spamming pages effectively.

Layout Analysis for Calculation of Web Page Similarity as Image

  • Mitsuhashi, Noriaki;Yamaguchi, Toru;Takama, Yasufumi
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.142-145
    • /
    • 2003
  • When we search information on the Web using search engines, they only analyze the text information collected from the source files of Web pages. However, there is a limit to analyze the layout of a Web page only from its source file, although Web page design is the most important factor for a user to estimate a page. In particular it often happens on the Web that the pages of similar design ofter similar information. We propose a method to analyze layout for comparing the design of pages by treating the displayed page as image.

  • PDF

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

An Automatic Web Page Classification System Using Meta-Tag (메타 태그를 이용한 자동 웹페이지 분류 시스템)

  • Kim, Sang-Il;Kim, Hwa-Sung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.38B no.4
    • /
    • pp.291-297
    • /
    • 2013
  • Recently, the amount of web pages, which include various information, has been drastically increased according to the explosive increase of WWW usage. Therefore, the need for web page classification arose in order to make it easier to access web pages and to make it possible to search the web pages through the grouping. Web page classification means the classification of various web pages that are scattered on the web according to the similarity of documents or the keywords contained in the documents. Web page classification method can be applied to various areas such as web page searching, group searching and e-mail filtering. However, it is impossible to handle the tremendous amount of web pages on the web by using the manual classification. Also, the automatic web page classification has the accuracy problem in that it fails to distinguish the different web pages written in different forms without classification errors. In this paper, we propose the automatic web page classification system using meta-tag that can be obtained from the web pages in order to solve the inaccurate web page retrieval problem.

A Structural Complexity Metric for Web Application based on Similarity (유사도 기반의 웹 어플리케이션 구조 복잡도)

  • Jung, Woo-Sung;Lee, Eun-Joo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.15 no.8
    • /
    • pp.117-126
    • /
    • 2010
  • Software complexity is used to evaluate a target system's maintainability. The existing complexity metrics on web applications are count-based, so it is hard to incorporate the understandability of developers or maintainers. To make up for this shortcomings, entropy-theory can be applied to define complexity, however, it is assumed that information quantity of each paper is identical. In this paper, structural complexity of a web application is defined based on information theory and similarity. In detail, the proposed complexity is defined using entropy as the previous approach, but the information quantity of individual pages is defined using similarity. That is, a page which are similar with many pages has smaller information quantity than a page which are dissimilar to others. Furthermore, various similarity measures can be used for various views, which results in many-sided complexity measures. Finally, several complexity properties are applied to verify the proposed metric and case studies shows the applicability of the metric.

Improving Performance of Search Engine Using Category based Evaluation (범주 기반 평가를 이용한 검색시스템의 성능 향상)

  • Kim, Hyung-Il;Yoon, Hyun-Nim
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.1
    • /
    • pp.19-29
    • /
    • 2013
  • In the current Internet environment where there is high space complexity of information, search engines aim to provide accurate information that users want. But content-based method adopted by most of search engines cannot be used as an effective tool in the current Internet environment. As content-based method gives different weights to each web page using morphological characteristics of vocabulary, the method has its drawbacks of not being effective in distinguishing each web page. To resolve this problem and provide useful information to the users, this paper proposes an evaluation method based on categories. Category-based evaluation method is to extend query to semantic relations and measure the similarity to web pages. In applying weighting to web pages, category-based evaluation method utilizes user response to web page retrieval and categories of query and thus better distinguish web pages. The method proposed in this paper has the advantage of being able to effectively provide the information users want through search engines and the utility of category-based evaluation technique has been confirmed through various experiments.