• Title/Summary/Keyword: Page-Rank

Search Result 102, Processing Time 0.02 seconds

Korean Web Content Extraction using Tag Rank Position and Gradient Boosting (태그 서열 위치와 경사 부스팅을 활용한 한국어 웹 본문 추출)

  • Mo, Jonghoon;Yu, Jae-Myung
    • Journal of KIISE
    • /
    • v.44 no.6
    • /
    • pp.581-586
    • /
    • 2017
  • For automatic web scraping, unnecessary components such as menus and advertisements need to be removed from web pages and main contents should be extracted automatically. A content block tends to be located in the middle of a web page. In particular, Korean web documents rarely include metadata and have a complex design; a suitable method of content extraction is therefore needed. Existing content extraction algorithms use the textual and structural features of content blocks because processing visual features requires heavy computation for rendering and image processing. In this paper, we propose a new content extraction method using the tag positions in HTML as a quasi-visual feature. In addition, we develop a tag rank position, a type of tag position not affected by text length, and show that gradient boosting with the tag rank position is a very accurate content extraction method. The result of this paper shows that the content extraction method can be used to collect high-quality text data automatically from various web pages.

Nonparametric procedures based on aligned method and placement for ordered alternatives in randomized block design (랜덤화 블록 모형에서 정렬방법과 위치를 이용한 순서형 대립가설에 대한 비모수 검정법)

  • Kim, Hyosook;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.4
    • /
    • pp.707-717
    • /
    • 2016
  • Nonparametric procedures in a randomized block design was proposed by Friedman (1937) as a general alternative as well as suggested as a test for ordered alternatives by Page (1963). These methods are used for the rank of treatments in each block. In this paper, we proposed nonparametric procedures using aligned method proposed by Hodges and Lehmann (1962) to reduce among block information and based on placement suggested by Kim (1999) in a randomized block design. We also perform a Monte Carlo study to compare the empirical powers of the proposed procedures and established method.

A Query Randomizing Technique for breaking 'Filter Bubble'

  • Joo, Sangdon;Seo, Sukyung;Yoon, Youngmi
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.12
    • /
    • pp.117-123
    • /
    • 2017
  • The personalized search algorithm is a search system that analyzes the user's IP, cookies, log data, and search history to recommend the desired information. As a result, users are isolated in the information frame recommended by the algorithm. This is called 'Filter bubble' phenomenon. Most of the personalized data can be deleted or changed by the user, but data stored in the service provider's server is difficult to access. This study suggests a way to neutralize personalization by keeping on sending random query words. This is to confuse the data accumulated in the server while performing search activities with words that are not related to the user. We have analyzed the rank change of the URL while conducting the search activity with 500 random query words once using the personalized account as the experimental group. To prove the effect, we set up a new account and set it as a control. We then searched the same set of queries with these two accounts, stored the URL data, and scored the rank variation. The URLs ranked on the upper page are weighted more than the lower-ranked URLs. At the beginning of the experiment, the difference between the scores of the two accounts was insignificant. As experiments continue, the number of random query words accumulated in the server increases and results show meaningful difference.

The Effective Blog Search Algorithm based on the Structural Features in the Blogspace (블로그의 구조적 특성을 고려한 효율적인 블로그 검색 알고리즘)

  • Kim, Jung-Hoon;Yoon, Tae-Bok;Lee, Jee-Hyong
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.7
    • /
    • pp.580-589
    • /
    • 2009
  • Today, most web pages are being created in the blogspace or evolving into the blogspace. A blog entry (blog page) includes non-traditional features of Web pages, such as trackback links, bloggers' authority, tags, and comments. Thus, the traditional rank algorithms are not proper to evaluate blog entries because those algorithms do not consider the blog specific features. In this paper, a new algorithm called "Blog-Rank" is proposed. This algorithm ranks blog entries by calculating bloggers' reputation scores, trackback scores, and comment scores based on the features of the blog entries. This algorithm is also applied to searching for information related to the users' queries in the blogspace. The experiment shows that it finds the much more relevant information than the traditional ranking algorithms.

A PageRank based Data Indexing Method for Designing Natural Language Interface to CRM Databases (분석 CRM 실무자의 자연어 질의 처리를 위한 기업 데이터베이스 구성요소 인덱싱 방법론)

  • Park, Sung-Hyuk;Hwang, Kyeong-Seo;Lee, Dong-Won
    • CRM연구
    • /
    • v.2 no.2
    • /
    • pp.53-70
    • /
    • 2009
  • Understanding consumer behavior based on the analysis of the customer data is one essential part of analytic CRM. To do this, the analytic skills for data extraction and data processing are required to users. As a user has various kinds of questions for the consumer data analysis, the user should use database language such as SQL. However, for the firm's user, to generate SQL statements is not easy because the accuracy of the query result is hugely influenced by the knowledge of work-site operation and the firm's database. This paper proposes a natural language based database search framework finding relevant database elements. Specifically, we describe how our TableRank method can understand the user's natural query language and provide proper relations and attributes of data records to the user. Through several experiments, it is supported that the TableRank provides accurate database elements related to the user's natural query. We also show that the close distance among relations in the database represents the high data connectivity which guarantees matching with a search query from a user.

  • PDF

Performance Analysis on Hadoop with SSD for Interative Process (SSD 타입 저장장치를 포함하는 Hadoop 시스템의 Iterative Processing 처리 성능 분석)

  • Oh, Sangyoon;Kwon, Seong-Min;Lee, Sookyung
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2016.07a
    • /
    • pp.191-193
    • /
    • 2016
  • 본 논문에서는 SSD 저장장치를 포함하는 하둡의 Iterative Processing에 대한 성능 분석 결과를 소개한다. 하둡은 맵 리듀스 병렬 프로그래밍 모델을 통해 Batch Processing에 특화된 구조를 가지고 있는 프레임 워크이다. 이는 병렬/분산 환경에서 큰 성능향상을 보장하지만, 반복 작업을 수행하는 Iterative Processing에 대하여는 성능이 낮아지는 문제가 존재하고 있다. 이에 본 논문에서는 점차 낮아지는 가격으로 인해 하둡시스템에 적용 가능성이 타진되는 SSD를 통해 반복 작업의 성능이슈를 해결할 수 있는지 확인하고, SSD를 통한 성능향상의 요소가 존재하는지 알아보고자 실험을 진행하였다. 실험에서는 Batch Processing인 word count와 Iterative Processing인 Page Rank 알고리즘을 MapReduce로 구현하고 데이터 크기에 따른 성능 향상도를 측정하였고, SSD 추가와 같은 하드웨어적인 성능을 통한 하둡의 반복 작업은 큰 효율을 기대하기가 어렵다는 결론을 보였다.

  • PDF

Efficient Internet Information Extraction Using Hyperlink Structure and Fitness of Hypertext Document (웹의 연결구조와 웹문서의 적합도를 이용한 효율적인 인터넷 정보추출)

  • Hwang Insoo
    • Journal of Information Technology Applications and Management
    • /
    • v.11 no.4
    • /
    • pp.49-60
    • /
    • 2004
  • While the World-Wide Web offers an incredibly rich base of information, organized as a hypertext it does not provide a uniform and efficient way to retrieve specific information. Therefore, it is needed to develop an efficient web crawler for gathering useful information in acceptable amount of time. In this paper, we studied the order in which the web crawler visit URLs to rapidly obtain more important web pages. We also developed an internet agent for efficient web crawling using hyperlink structure and fitness of hypertext documents. As a result of experiment on a website. it is shown that proposed agent outperforms other web crawlers using BackLink and PageRank algorithm.

  • PDF

Performance Analysis of Web-Crawler in Multi-thread Environment (다중 쓰레드 환경에서 웹 크롤러의 성능 분석)

  • Park, Jung-Woo;Kim, Jun-Ho;Lee, Won-Joo;Jeon, Chang-Ho
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2009.01a
    • /
    • pp.473-476
    • /
    • 2009
  • 본 논문에서는 다중 쓰레드 환경에서 동작하는 웹 크롤러를 구현하고 성능을 분석한다. 이 웹 크롤러의 특징은 검색시간을 단축하기 위하여 크롤링, 파싱 및 페이지랭킹, DB 저장 모듈을 서로 독립적으로 다른 작업을 수행하도록 구현한 것이다. 크롤링 모듈은 웹상의 데이터를 수집하는 기능을 제공한다. 그리고 파싱 및 페이지랭크 모듈은 수집한 데이터를 파싱하고, 웹 페이지의 상대적인 중요도를 수치로 계산하여 페이지랭크를 지정한다. DB 연동 모듈은 페이지랭크 모듈에서 구한 페이지랭크를 데이터베이스에 저장한다. 성능평가에서는 다중 쓰레드 환경에서 쓰레드 수와 웹 페이지의 수에 따른 검색 시간을 측정하여 그 결과를 비교 평가한다.

  • PDF

SNA to assess the Influence of Organization Members (Focusing on core members of North Korea)

  • Lee, Young-Seok;Yoon, Soungwoong;Lee, Sang-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.7
    • /
    • pp.73-80
    • /
    • 2018
  • There are various organizations in modern society, in which people have direct and indirect relationships. Internal structure of these organizations can be analyzed by the relationships which are officially pressed on the media. However, this task will be difficult when the media information is strictly limited, though the necessity of analyzing organization structure remains. In this study, we try to estimate the influence of North Korea's core members by using PageRank centrality to supplement the limitation of previous SNA analysis methods. Experimental results show that we can show and predict NK's power shifts more efficiently.

Link ranking-based hierarchical structuring of web site (링크 중요도에 기반한 웹사이트의 계층 구조화)

  • Lim, Tae-Soo;Park, Bum-Hwan;Lee, Woo-Key
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.11b
    • /
    • pp.745-747
    • /
    • 2005
  • 수많은 웹페이지들이 하이퍼링크를 통해 복잡하게 연결된 그래프 구조를 가지고 있는 웹사이트를 계층적으로 구조화하는 것은 해당 사이트를 검색하고자 할 때, 정보를 재조직화하고 고려해야 할 대안들의 개수를 감소시킨다는 점에서 매우 유용하다. 본 논문은 웹사이트의 의미론적인 계층화를 최적화하기 위하여 사용자의 순회 경로, 즉 웹아크의 중요도 합을 최대화할 수 있는 트리 구조를 생성하였다. 구체적으로 첫째 PageRank에 기반한 웹아크 중요도를 생성하였고, 둘째 Minimum-Cost Arborescence 문제를 이용하여 최적 트리 구조를 생성하였다. 사용자의 질의에 독립적으로 생성된 트리 구조는 웹사이트의 의미 있는 계층 구조로서 사용자로 하여금 해당 사이트를 보다 효과적으로 검색할 수 있도록 도와줄 것이다.

  • PDF