• Title/Summary/Keyword: Crawler

Search Result 202, Processing Time 0.029 seconds

Design and Implementation of Event-driven Real-time Web Crawler to Maintain Reliability (신뢰성 유지를 위한 이벤트 기반 실시간 웹크롤러의 설계 및 구현)

  • Ahn, Yong-Hak
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.4
    • /
    • pp.1-6
    • /
    • 2022
  • Real-time systems using web cralwing data must provide users with data from the same database as remote data. To do this, the web crawler repeatedly sends HTTP(HtypeText Transfer Protocol) requests to the remote server to see if the remote data has changed. This process causes network load on the crawling server and remote server, causing problems such as excessive traffic generation. To solve this problem, in this paper, based on user events, we propose a real-time web crawling technique that can reduce the overload of the network while securing the reliability of maintaining the sameness between the data of the crawling server and data from multiple remote locations. The proposed method performs a crawling process based on an event that requests unit data and list data. The results show that the proposed method can reduce the overhead of network traffic in existing web crawlers and secure data reliability. In the future, research on the convergence of event-based crawling and time-based crawling is required.

Design and implementation of trend analysis system through deep learning transfer learning (딥러닝 전이학습을 이용한 경량 트렌드 분석 시스템 설계 및 구현)

  • Shin, Jongho;An, Suvin;Park, Taeyoung;Bang, Seungcheol;Noh, Giseop
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.87-89
    • /
    • 2022
  • Recently, as more consumers spend more time at home due to COVID-19, the time spent on digital consumption such as SNS and OTT, which can be easily used non-face-to-face, naturally increased. Since 2019, when COVID-19 occurred, digital consumption has doubled from 44% to 82%, and it is important to quickly and accurately grasp and apply trends by analyzing consumers' emotions due to the rapidly changing digital characteristics. However, there are limitations in actually implementing services using emotional analysis in small systems rather than large-scale systems, and there are not many cases where they are actually serviced. However, if even a small system can easily analyze consumer trends, it will help the rapidly changing modern society. In this paper, we propose a lightweight trend analysis system that builds a learning network through Transfer Learning (Fine Tuning) of the BERT Model and interlocks Crawler for real-time data collection.

  • PDF

Development of Web Crawler for Archiving Web Resources (웹 자원 아카이빙을 위한 웹 크롤러 연구 개발)

  • Kim, Kwang-Young;Lee, Won-Goo;Lee, Min-Ho;Yoon, Hwa-Mook;Shin, Sung-Ho
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.9
    • /
    • pp.9-16
    • /
    • 2011
  • There are no way of collection, preservation and utilization for web resources after the service is terminated and is gone. However, these Web resources, regardless of the importance of periodically or aperiodically updated or have been destroyed. Therefore, to collect and preserve Web resources Web archive is being emphasized. Web resources collected periodically in order to develop Web archiving crawlers only was required. In this study, from the collection of Web resources to be used for archiving existing web crawlers to analyze the strengths and weaknesses. We have developed web archiving systems for the best collection of web resources.

HA Study on the Selection of Mobile Crane Model for Heavy Equipments Installation (중량물 설치 시 이동식 크레인 기종선정에 관한 연구)

  • Jeong, Jae-Bok;Yoo, Ho-Seon
    • Plant Journal
    • /
    • v.8 no.2
    • /
    • pp.59-69
    • /
    • 2012
  • This study focuses on avoiding the failures from the wrong selections by experiences as simulation programs is not available, and suggests the methods which effectively select the alternatives when the selected model is not appropriate for the original plan. First, CC8800-1K of DEMAG has the longest boom whose length is 216 m at the maximum. The combination of the boom is feasible to second level except for MANITIWOC M 2250 (M-1200 RINGER) which is possible to third level. Second, the angle of boom is from 20 degrees to 82 degrees. Suitable angle to work is in the 55-78 degrees. The working load of crawler type and hydraulic one to be applied is 75-85% in the critical loads capacity. As increasing operating radius, crawler type is a favorable position over hydraulic one. Lastly, related problems were verified through examination by suggestions for the design of the selection methods for the case analysis. The major problems are stemming from the selection based on its experiences, unreasonable demand for the existing facility and repeated selections by the designer who accumulates his experiences via same or similar projects.

  • PDF

A Study on Political Attitude Estimation of Korean OSN Users (온라인 소셜네트워크를 통한 한국인의 정치성향 예측 기법의 연구)

  • Wijaya, Muhammad Eka;Ahn, Heejune
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.21 no.4
    • /
    • pp.1-11
    • /
    • 2016
  • Recently numerous studies are conducted to estimate the human personality from the online social activities. This paper develops a comprehensive model for political attitude estimation leveraging the Facebook Like information of the users. We designed a Facebook Crawler that efficiently collects data overcoming the difficulties in crawling Ajax enabled Facebook pages. We show that the category level selection can reduce the data analysis complexity utilizing the sparsity of the huge like-attitude matrix. In the Korean Facebook users' context, only 28 criteria (3% of the total) can estimate the political polarity of the user with high accuracy (AUC of 0.82).

Modern Concurrent Programming for Multicode Environment (동시성으로 작성하는 파이썬 크롤러)

  • Kim, Nam-gue;Kang, Young-Jin;Lee, HoonJae
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.430-433
    • /
    • 2017
  • Programming that ensures concurrency is essential for developers. If you do not use it, it is hard to expect the speed of the program to improve unless there is technical advancement of the hardware itself. Programming languages that support good concurrency code include go, elixir, and scala. Python, which supports a number of useful libraries, also supports concurrent programming like asyncio and coroutine. This paper defines the concepts of concurrency and parallelism, and explains what to note when writing concurrency programming in Python. The crawler that collects web data is written in concurrent code and compared with programs written in sequential, multithreaded code.

  • PDF

Design and Implementation of Web Crawler utilizing Unstructured data

  • Tanvir, Ahmed Md.;Chung, Mokdong
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.3
    • /
    • pp.374-385
    • /
    • 2019
  • A Web Crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this paper, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. The system goal is filtered at the data acquisition level and for a large taxonomy. The main problem in text taxonomy is how to improve the classification accuracy. In order to improve the accuracy, we propose a new weighting method of TF-IDF. In this paper, we modified TF-algorithm to calculate the accuracy of unstructured data. Finally, our system proposes a competent web pages search crawling algorithm, which is derived from TF-IDF and RL Web search algorithm to enhance the searching efficiency of the relevant information. In this paper, an attempt has been made to research and examine the work nature of crawlers and crawling algorithms in search engines for efficient information retrieval.

Asynchronous Web Crawling Algorithm (링크 분석을 통한 비동기 웹 페이지 크롤링 알고리즘)

  • Won, Dong-Hyun;Park, Hyuk-Gyu;Kang, Yun-Jeong;Lee, Min-Hye
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.364-366
    • /
    • 2022
  • The web uses an asynchronous web method to provide various information having different processing speeds together. The asynchronous method has the advantage of being able to respond to other events even before the task is completed, but a typical crawler has difficulty collecting information provided asynchronously by collecting point-of-visit information on a web page. In addition, asynchronous web pages often do not change their web address even if the page content is changed, making it difficult to crawl. In this paper, we propose a web crawling algorithm considering asynchronous page movement by analyzing links in the web. With the proposed algorithm, it was possible to collect dictionary information on TTA terms that provide information asynchronously.

  • PDF

Crepe Search System Design using Web Crawling (웹 크롤링 이용한 크레페 검색 시스템 설계)

  • Kim, Hyo-Jong;Han, Kun-Hee;Shin, Seung-Soo
    • Journal of Digital Convergence
    • /
    • v.15 no.11
    • /
    • pp.261-269
    • /
    • 2017
  • The purpose of this paper is to provide a search system using a method of accessing the web in real time without using a database server in order to guarantee the up-to-date information in a single network, rather than using a plurality of bots connected by a wide area network Design. The method of the research is to design and analyze the system which can search the person and keyword quickly and accurately in crepe system. In the crepe server, when the user registers information, the body tag matching conversion process stores all the information as it is, since various styles are applied to each user, such as a font, a font size, and a color. The crepe server does not cause a problem of body tag matching. However, when executing the crepe retrieval system, the style and characteristics of users can not be formalized. This problem can be solved by using the html_img_parser function and the Go language html parser package. By applying queues and multiple threads to a general-purpose web crawler, rather than a web crawler design that targets a specific site, it is possible to utilize a multiplier that quickly and efficiently searches and collects various web sites in various applications.