DOI QR코드

DOI QR Code

Web crawler Improvement and Dynamic process Design and Implementation for Effective Data Collection

효과적인 데이터 수집을 위한 웹 크롤러 개선 및 동적 프로세스 설계 및 구현

  • Wang, Tae-su (Department of Computer Engineering, Dong-Eui University) ;
  • Song, JaeBaek (Department of Computer Engineering, Dong-Eui University) ;
  • Son, Dayeon (Department of Computer Engineering, Dong-Eui University) ;
  • Kim, Minyoung (Research Institute of ICT Fusion and Convergence, Dong-Eui University) ;
  • Choi, Donggyu (Institute of Smart IT, Dong-Eui University) ;
  • Jang, Jongwook (Department of Computer Engineering, Dong-Eui University)
  • Received : 2022.10.31
  • Accepted : 2022.11.08
  • Published : 2022.11.30

Abstract

Recently, a lot of data has been generated according to the diversity and utilization of information, and the importance of big data analysis to collect, store, process and predict data has increased, and the ability to collect only necessary information is required. More than half of the web space consists of text, and a lot of data is generated through the organic interaction of users. There is a crawling technique as a representative method for collecting text data, but many crawlers are being developed that do not consider web servers or administrators because they focus on methods that can obtain data. In this paper, we design and implement an improved dynamic web crawler that can efficiently fetch data by examining problems that may occur during the crawling process and precautions to be considered. The crawler, which improved the problems of the existing crawler, was designed as a multi-process, and the work time was reduced by 4 times on average.

근래 정보의 다양성과 활용에 따라 많은 데이터가 생성되었고, 데이터를 수집, 저장, 가공 및 예측 하는 빅데이터 분석의 중요성이 확대되었으며, 필요한 정보만을 수집할 수 있는 능력이 요구되고 있다. 웹 공간은 절반 이상이 텍스트로 이루어져 있고, 유저들의 유기적인 상호작용을 통해 수많은 데이터가 발생한다. 대표적인 텍스트 데이터 수집 방법으로 크롤링 기법이 있으나 데이터를 가져올 수 있는 방법에 치중되어 웹 서버나 관리자를 배려하지 못하는 크롤러가 많이 개발되고 있다. 본 논문에서는 크롤링 과정에서 발생할 수 있는 문제점 및 고려해야 할 주의사항에 대해 살펴보고 효율적으로 데이터를 가져올 수 있는 개선된 동적 웹 크롤러를 설계 및 구현한다. 기존 크롤러의 문제점들을 개선한 크롤러는 멀티프로세스로 설계되어 작업소요 시간이 평균적으로 4배정도 감소하였다.

Keywords

Acknowledgement

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Grand Information Technology Research Center support program(IITP-2022-2016-0-00318) supervised by the IITP(Institute for Information & communications Technology Planning & Evaluation). Also, this research was supported by the BB21plus funded by Busan Metropolitan City and Busan Institute for Talent & Lifelong Education (BIT). Finally Thank you Ms. Bomin Kim(20180020@office.deu.ac.kr) for completing the grammar inspection of this paper.

References

  1. J. H. Kim and E. G. Kim, "WCTT: Web Crawling System based on HTML Document Formalization," Journal of the Korea Institute of Information and Communication Engineering, vol. 26, no. 4, pp. 495-502, Apr. 2022. https://doi.org/10.6109/JKIICE.2022.26.4.495
  2. Samsung Display Newsroom. Collect only the information you want! Leverage crawling and big data analytics [Internet]. Available: https://news.samsungdisplay.com/22907.
  3. DATA ON-AIR. Data collection methods and techniques [Internet]. Available: https://dataonair.or.kr/db-tech-reference/d-guide/data-practical/?mod=document&uid=378.
  4. Y. -R. Suh, K. P. Koh, and J. Lee, "An analysis of the change in media's reports and attitudes about face masks during the COVID-19 pandemic in South Korea: a study using Big Data latent dirichlet allocation (LDA) topic modelling," Journal of the Korea Institute of Information and Communication Engineering, vol. 25, no. 5, pp. 731-740, May 2021. https://doi.org/10.6109/JKIICE.2021.25.5.731
  5. C. Y. An, S. W. Moon, E. H. Shin, and H. Kim, "Study on Effective Web Services for Data Acquisition, Analysis, and Visualization," Journal of D-Culture Archives (JDCA), vol. 4, no. 2, pp. 113-122, Oct. 2021. https://doi.org/10.23089/JDCA.2021.4.2.008
  6. J. S. Han, J. S. Kim, I. B. Kim, and H. I. Lee, "Building Personal Blogs with Static Site Generators," in Proceeding of the Korea Contents Association Comprehensive Conference, Daejeon, Korea, pp. 475-476, 2021.
  7. J. S. Yoo, S. Y. Heo, and S. W. Park, "Forgery detection system of dynamic web page using snapshot," in Proceeding of the Korean Institute of Information Scientists and Engineers, Pyeongchang, Korea, pp. 1612-1614, 2019.
  8. KDB VELOG. Web Data Crawling [Internet]. Available: https://velog.io/@kimdukbae/.
  9. Cosmos Project. Python Basic: Python coroutine, coroutine [Internet]. Available: https://cosmosproject.tistory.com/474.
  10. 101 Help. 25 Best Free Web Crawler Tools [Internet]. Available: https://ko.101-help.com/25gaji-coegoyi-muryo-web-keurolreo-dogu-baa8db87e8/.
  11. Exmemory Tistory. A productive web crawler structure for collecting large amounts of data [Internet]. Available: https://exmemory.tistory.com/.
  12. ScrapeHero. How to Scrape Websites Without Getting Blocked [Internet]. Available: https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/.
  13. HACKERNOON. Web Scraping Tutorial with Python: Tips and Tricks [Internet]. Available: https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071.
  14. C.-W. Na and B.-W. On, "A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms," Journal of Internet Computing and Services, vol. 20, no. 3, pp. 43-59, Jun. 2019. https://doi.org/10.7472/JKSII.2019.20.3.43
  15. Tigercow Door Tistory. Multi-processing and Multi-threading Source [Internet]. Available: https://doorbw.tistory.com/205.
  16. Yeko90 Tistory. Learn-headless-and-multiple-function-with python-basic-selenium-addargument [Internet]. Available: https://yeko90.tistory.com.
  17. Apache Commons. A Universally Unique Identifier (UUID) [Internet]. Available: https://commons.apache.org/sandbox/commons-id/uuid.html.
  18. C. Li, C. Ding, and K. Shen, "Quantifying The Cost of Context Switch," in Proceedings of the 2007 workshop on Experimental computer science, San Diego: CA, USA, pp. 218, 2007.
  19. T. -S. Hur, J. -H. Kim, and S. -H. Baek, "Recruitment collector using multiple processes based on Python," in Proceedings of the Korean Society of Computer Information Conference, Jeju, Korea, pp. 229-230, 2019.