• Title/Summary/Keyword: 웹 크롤링

Search Result 113, Processing Time 0.035 seconds

Effective Web Crawling Orderings from Graph Search Techniques (그래프 탐색 기법을 이용한 효율적인 웹 크롤링 방법들)

  • Kim, Jin-Il;Kwon, Yoo-Jin;Kim, Jin-Wook;Kim, Sung-Ryul;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.1
    • /
    • pp.27-34
    • /
    • 2010
  • Web crawlers are fundamental programs which iteratively download web pages by following links of web pages starting from a small set of initial URLs. Previously several web crawling orderings have been proposed to crawl popular web pages in preference to other pages, but some graph search techniques whose characteristics and efficient implementations had been studied in graph theory community have not been applied yet for web crawling orderings. In this paper we consider various graph search techniques including lexicographic breadth-first search, lexicographic depth-first search and maximum cardinality search as well as well-known breadth-first search and depth-first search, and then choose effective web crawling orderings which have linear time complexity and crawl popular pages early. Especially, for maximum cardinality search and lexicographic breadth-first search whose implementations are non-trivial, we propose linear-time web crawling orderings by applying the partition refinement method. Experimental results show that maximum cardinality search has desirable properties in both time complexity and the quality of crawled pages.

A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms (최신 웹 크롤링 알고리즘 분석 및 선제적인 크롤링 기법 제안)

  • Na, Chul-Won;On, Byung-Won
    • Journal of Internet Computing and Services
    • /
    • v.20 no.3
    • /
    • pp.43-59
    • /
    • 2019
  • Today, with the spread of smartphones and the development of social networking services, structured and unstructured big data have stored exponentially. If we analyze them well, we will get useful information to be able to predict data for the future. Large amounts of data need to be collected first in order to analyze big data. The web is repository where these data are most stored. However, because the data size is large, there are also many data that have information that is not needed as much as there are data that have useful information. This has made it important to collect data efficiently, where data with unnecessary information is filtered and only collected data with useful information. Web crawlers cannot download all pages due to some constraints such as network bandwidth, operational time, and data storage. This is why we should avoid visiting many pages that are not relevant to what we want and download only important pages as soon as possible. This paper seeks to help resolve the above issues. First, We introduce basic web-crawling algorithms. For each algorithm, the time-complexity and pros and cons are described, and compared and analyzed. Next, we introduce the state-of-the-art web crawling algorithms that have improved the shortcomings of the basic web crawling algorithms. In addition, recent research trends show that the web crawling algorithms with special purposes such as collecting sentiment words are actively studied. We will one of the introduce Sentiment-aware web crawling techniques that is a proactive web crawling technique as a study of web crawling algorithms with special purpose. The result showed that the larger the data are, the higher the performance is and the more space is saved.

Comparison and Application of Dynamic and Static Crawling for Extracting Product Data from Web Pages (웹페이지에서의 상품 데이터 추출을 위한 동적, 정적 크롤링 비교 및 활용)

  • Sang-Hyuk Kim;Jeong-Hoon Kim;Seung-Dae Lee
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.6
    • /
    • pp.1277-1284
    • /
    • 2023
  • In this paper, a web page that is easy for consumers to access event products in progress at convenience stores was created. In the production process, static crawling and dynamic crawling, two crawling methods for extracting data from event products, were compared and used. Static crawling is an extraction method of collecting static data from a homepage, and dynamic crawling is a method of collecting data from pages dynamically generated from a web page. Through the comparison of the two crawlings, we studied which crawl method is more effective in extracting event product data. Among them, a web page was created using effective static crawling, and 1+1 and 2+1 products were categorized and a search function was added to create a web page.

Designing and implementing web crawling-based SNS web site (웹 크롤링 기반 SNS웹사이트 설계 및 구현)

  • Yoon, Kyung Seob;Kim, Yeon Hong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.01a
    • /
    • pp.21-24
    • /
    • 2018
  • 기존 Facebook 페이지의 경우에는 수많은 제보 글이 올라와 사용자가 원하는 글을 찾기 어렵다는 문제점이 발생하고 있다. 본 논문에서는 이를 위해 다양한 Facebook 페이지 내용을 크롤링하여 사용자가 원하는 Facebook 페이지 내용을 검색하여 사용자에게 제공할 수 있도록 데이터베이스 서버에 저장 한 후 크롤링 된 Facebook 페이지 내용을 제공할 수 있는 웹사이트를 설계하고 구현한다.

  • PDF

Design and Implementation of Event-driven Real-time Web Crawler to Maintain Reliability (신뢰성 유지를 위한 이벤트 기반 실시간 웹크롤러의 설계 및 구현)

  • Ahn, Yong-Hak
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.4
    • /
    • pp.1-6
    • /
    • 2022
  • Real-time systems using web cralwing data must provide users with data from the same database as remote data. To do this, the web crawler repeatedly sends HTTP(HtypeText Transfer Protocol) requests to the remote server to see if the remote data has changed. This process causes network load on the crawling server and remote server, causing problems such as excessive traffic generation. To solve this problem, in this paper, based on user events, we propose a real-time web crawling technique that can reduce the overload of the network while securing the reliability of maintaining the sameness between the data of the crawling server and data from multiple remote locations. The proposed method performs a crawling process based on an event that requests unit data and list data. The results show that the proposed method can reduce the overhead of network traffic in existing web crawlers and secure data reliability. In the future, research on the convergence of event-based crawling and time-based crawling is required.

Web-Anti-MalWare Malware Detection System (악성코드 탐지 시스템 Web-Anti-Malware)

  • Jung, Seung-il;Kim, Hyun-Woo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.07a
    • /
    • pp.365-367
    • /
    • 2014
  • 최근 웹 서비스의 증가와 악성코드는 그 수를 판단 할 수 없을 정도로 빠르게 늘어나고 있다. 매년 늘어나는 악성코드는 금전적 이윤 추구가 악성코드의 주된 동기가 되고 있으며 이는 공공기관 및 보안 업체에서도 악성코드를 탐지하기 위한 연구가 활발히 진행되고 있다. 본 논문에서는 실시간으로 패킷을 분석할수 있는 필터링과 웹 크롤링을 통해 도메인 및 하위 URL까지 자동적으로 탐지할 수 있는 악성코드 탐지 시스템을 제안한다.

  • PDF

Web Crawler Service Implementation for Information Retrieval based on Big Data Analysis (빅데이터 분석 기반의 정보 검색을 위한 웹 크롤러 서비스 구현)

  • Kim, Hye-Suk;Han, Na;Lim, Suk-Ja
    • Journal of Digital Contents Society
    • /
    • v.18 no.5
    • /
    • pp.933-942
    • /
    • 2017
  • In this paper, we propose a web crawler service method for collecting information efficiently about college students and job-seeker's external activities, competition, and scholarship. The proposed web crawler service uses Jsoup tree analysis and Json format data transmission method to avoid problems of duplicated crawling while crawling at high speed. After collecting relevant information for 24 hours, we were able to confirm that the web crawler service is running with an accuracy of 100%. It is expected that the web crawler service can be applied to various web sites in the future to improve the web crawler service.

Design and Implemention of Real-time web Crawling distributed monitoring system (실시간 웹 크롤링 분산 모니터링 시스템 설계 및 구현)

  • Kim, Yeong-A;Kim, Gea-Hee;Kim, Hyun-Ju;Kim, Chang-Geun
    • Journal of Convergence for Information Technology
    • /
    • v.9 no.1
    • /
    • pp.45-53
    • /
    • 2019
  • We face problems from excessive information served with websites in this rapidly changing information era. We find little information useful and much useless and spend a lot of time to select information needed. Many websites including search engines use web crawling in order to make data updated. Web crawling is usually used to generate copies of all the pages of visited sites. Search engines index the pages for faster searching. With regard to data collection for wholesale and order information changing in realtime, the keyword-oriented web data collection is not adequate. The alternative for selective collection of web information in realtime has not been suggested. In this paper, we propose a method of collecting information of restricted web sites by using Web crawling distributed monitoring system (R-WCMS) and estimating collection time through detailed analysis of data and storing them in parallel system. Experimental results show that web site information retrieval is applied to the proposed model, reducing the time of 15-17%.

Implementation of perfume recommendation service using web crawling and image color extraction artificial intelligence (웹 크롤링과 이미지 색상 추출 인공지능을 이용한 향수 추천 서비스 구현)

  • Yu-jin Kim;Ye-lim Lee;Sung-Yoon Jung;Yu-jin Jo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.758-759
    • /
    • 2023
  • 이 논문에서는 웹 크롤링과 인공지능의 색상 추출 기능을 사용하여 사용자에게 맞는 향수를 추천해주는 서비스를 구현한다. 웹 사이트 제작에 용이한 Java 와 웹 크롤링과 인공지능 구현에 용이한 Python 을 기반으로 구현하였다.

Information-providing Application Based on Web Crawling (웹 크롤링을 통한 개인 맞춤형 정보제공 애플리케이션)

  • Ju-Hyeon Kim;Jeong-Eun Choi;U-Gyeong Shin;Min-Jun Piao;Tae-Kook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.295-296
    • /
    • 2023
  • 본 논문에서는 웹 크롤링을 통한 개인 맞춤형 정보제공 애플리케이션에 관해 연구하였다. 본 서비스는 Java의 Jsoup 라이브러리를 이용해서 웹 크롤링(Web Crawling)한 데이터를 MySQL에 저장한다. 이를 통해 사용자가 지정한 키워드를 필터링하여 사용자에게 정보를 제공한다. 예를 들어 사용자가 지정한 키워드 관련 공지 사항이 업데이트되면 구현한 앱 내에서 확인 가능하며, KakaoTalk 알림톡을 통해서도 업데이트된 정보를 실시간으로 전송받는 서비스를 구현하였다.