• Title/Summary/Keyword: 병렬 웹 크롤러

Search Result 6, Processing Time 0.022 seconds

Implementation of a Parallel Web Crawler for the Odysseus Large-Scale Search Engine (오디세우스 대용량 검색 엔진을 위한 병렬 웹 크롤러의 구현)

  • Shin, Eun-Jeong;Kim, Yi-Reun;Heo, Jun-Seok;Whang, Kyu-Young
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.6
    • /
    • pp.567-581
    • /
    • 2008
  • As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator machine to manage them. The parallel web crawler consists of three components: a crawling module for collecting web pages, a converting module for transforming the web pages into a database-friendly format, a ranking module for rating web pages based on their relative importance. We explain each component of the parallel web crawler and implementation methods in detail. Finally, we conduct extensive experiments to analyze the effectiveness of the parallel web crawler. The experimental results clarify the merit of our architecture in that the proposed parallel web crawler is scalable to the number of web pages to crawl and the number of machines used.

Distribute Parallel Crawler Design and Implementation (분산형 병렬 크롤러 설계 및 구현)

  • Jang, Hyun Ho;jeon, kyung-sik;Lee, HooKi
    • Convergence Security Journal
    • /
    • v.19 no.3
    • /
    • pp.21-28
    • /
    • 2019
  • As the number of websites managed by organizations or organizations increases, so does the number of web application servers and containers. In checking the status of the web service of the web application server and the container, it is very difficult for the person to check the status of the web service after accessing the physical server at the remote site through the terminal or using other accessible software It. Previous research on crawler-related research is hard to find any reference to the processing of data from crawling. Data loss occurs when the crawler accesses the database and stores the data. In this paper, we propose a method to store the inspection data according to crawl - based web application server management without losing data.

Multi-threaded Web Crawling Design using Queues (큐를 이용한 다중스레드 방식의 웹 크롤링 설계)

  • Kim, Hyo-Jong;Lee, Jun-Yun;Shin, Seung-Soo
    • Journal of Convergence for Information Technology
    • /
    • v.7 no.2
    • /
    • pp.43-51
    • /
    • 2017
  • Background/Objectives : The purpose of this study is to propose a multi-threaded web crawl using queues that can solve the problem of time delay of single processing method, cost increase of parallel processing method, and waste of manpower by utilizing multiple bots connected by wide area network Design and implement. Methods/Statistical analysis : This study designs and analyzes applications that run on independent systems based on multi-threaded system configuration using queues. Findings : We propose a multi-threaded web crawler design using queues. In addition, the throughput of web documents can be analyzed by dividing by client and thread according to the formula, and the efficiency and the number of optimal clients can be confirmed by checking efficiency of each thread. The proposed system is based on distributed processing. Clients in each independent environment provide fast and reliable web documents using queues and threads. Application/Improvements : There is a need for a system that quickly and efficiently navigates and collects various web sites by applying queues and multiple threads to a general purpose web crawler, rather than a web crawler design that targets a particular site.

Design and Implementation of a High Performance Web Crawler (고성능 웹크롤러의 설계 및 구현)

  • 권성호;이영탁;김영준;이용두
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.8 no.4
    • /
    • pp.64-72
    • /
    • 2003
  • A Web crawler is an important Internet software technology used in a variety of Internet application software which includes search engines. As Internet continues to grow, implementations of high performance web crawlers are urgently demanded. In this paper, we study how to support dynamic scheduling for a multiprocess-based web crawler. For high peformance, web crawlers are usually based on multiprocess in their implementations. In these systems, crawl scheduling which manages the allocation of web pages to each process for loading is one of the important issues. In this paper, we identify issues which are important and challenging in the crawl scheduling. To address the issue, we propose a dynamic crawl scheduling framework and subsequently a system architecture for a web crawler with dynamic crawl scheduling support. And we analysed the behaviors of Web crawler. Based on the analysis result, we suggest the direction for the design of high performance Web crawler.

  • PDF

Modern Concurrent Programming for Multicode Environment (동시성으로 작성하는 파이썬 크롤러)

  • Kim, Nam-gue;Kang, Young-Jin;Lee, HoonJae
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.430-433
    • /
    • 2017
  • Programming that ensures concurrency is essential for developers. If you do not use it, it is hard to expect the speed of the program to improve unless there is technical advancement of the hardware itself. Programming languages that support good concurrency code include go, elixir, and scala. Python, which supports a number of useful libraries, also supports concurrent programming like asyncio and coroutine. This paper defines the concepts of concurrency and parallelism, and explains what to note when writing concurrency programming in Python. The crawler that collects web data is written in concurrent code and compared with programs written in sequential, multithreaded code.

  • PDF

Design and Implementation of a High Performance Web Crawler (고성능 웹크롤러의 설계 및 구현)

  • Kim Hie-Cheol;Chae Soo-Hoan
    • Journal of Digital Contents Society
    • /
    • v.4 no.2
    • /
    • pp.127-137
    • /
    • 2003
  • A Web crawler is an important Internet software technology used in a variety of Internet application software which includes search engines. As Internet continues to grow, implementations of high performance web crawlers are urgently demanded. In this paper, we study how to support dynamic scheduling for a multiprocess-based web crawler. For high performance, web crawlers are usually based on multiprocess in their implementations. In these systems, crawl scheduling which manages the allocation of web pages to each process for loading is one of the important issues. In this paper, we identify issues which are important and challenging in the crawl scheduling. To address the issue, we propose a dynamic crawl scheduling framework and subsequently a system architecture for a web crawler with dynamic crawl scheduling support. This paper presents the design of the Web crawler with dynamic scheduling support.

  • PDF