Browse > Article
http://dx.doi.org/10.6109/jkiice.2022.26.11.1729

Web crawler Improvement and Dynamic process Design and Implementation for Effective Data Collection  

Wang, Tae-su (Department of Computer Engineering, Dong-Eui University)
Song, JaeBaek (Department of Computer Engineering, Dong-Eui University)
Son, Dayeon (Department of Computer Engineering, Dong-Eui University)
Kim, Minyoung (Research Institute of ICT Fusion and Convergence, Dong-Eui University)
Choi, Donggyu (Institute of Smart IT, Dong-Eui University)
Jang, Jongwook (Department of Computer Engineering, Dong-Eui University)
Abstract
Recently, a lot of data has been generated according to the diversity and utilization of information, and the importance of big data analysis to collect, store, process and predict data has increased, and the ability to collect only necessary information is required. More than half of the web space consists of text, and a lot of data is generated through the organic interaction of users. There is a crawling technique as a representative method for collecting text data, but many crawlers are being developed that do not consider web servers or administrators because they focus on methods that can obtain data. In this paper, we design and implement an improved dynamic web crawler that can efficiently fetch data by examining problems that may occur during the crawling process and precautions to be considered. The crawler, which improved the problems of the existing crawler, was designed as a multi-process, and the work time was reduced by 4 times on average.
Keywords
Web crawling; Multiprocessing; Data mining; Big data;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Cosmos Project. Python Basic: Python coroutine, coroutine [Internet]. Available: https://cosmosproject.tistory.com/474.
2 101 Help. 25 Best Free Web Crawler Tools [Internet]. Available: https://ko.101-help.com/25gaji-coegoyi-muryo-web-keurolreo-dogu-baa8db87e8/.
3 Exmemory Tistory. A productive web crawler structure for collecting large amounts of data [Internet]. Available: https://exmemory.tistory.com/.
4 ScrapeHero. How to Scrape Websites Without Getting Blocked [Internet]. Available: https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/.
5 C.-W. Na and B.-W. On, "A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms," Journal of Internet Computing and Services, vol. 20, no. 3, pp. 43-59, Jun. 2019.   DOI
6 Tigercow Door Tistory. Multi-processing and Multi-threading Source [Internet]. Available: https://doorbw.tistory.com/205.
7 Yeko90 Tistory. Learn-headless-and-multiple-function-with python-basic-selenium-addargument [Internet]. Available: https://yeko90.tistory.com.
8 Apache Commons. A Universally Unique Identifier (UUID) [Internet]. Available: https://commons.apache.org/sandbox/commons-id/uuid.html.
9 J. S. Yoo, S. Y. Heo, and S. W. Park, "Forgery detection system of dynamic web page using snapshot," in Proceeding of the Korean Institute of Information Scientists and Engineers, Pyeongchang, Korea, pp. 1612-1614, 2019.
10 KDB VELOG. Web Data Crawling [Internet]. Available: https://velog.io/@kimdukbae/.
11 T. -S. Hur, J. -H. Kim, and S. -H. Baek, "Recruitment collector using multiple processes based on Python," in Proceedings of the Korean Society of Computer Information Conference, Jeju, Korea, pp. 229-230, 2019.
12 C. Li, C. Ding, and K. Shen, "Quantifying The Cost of Context Switch," in Proceedings of the 2007 workshop on Experimental computer science, San Diego: CA, USA, pp. 218, 2007.
13 J. S. Han, J. S. Kim, I. B. Kim, and H. I. Lee, "Building Personal Blogs with Static Site Generators," in Proceeding of the Korea Contents Association Comprehensive Conference, Daejeon, Korea, pp. 475-476, 2021.
14 HACKERNOON. Web Scraping Tutorial with Python: Tips and Tricks [Internet]. Available: https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071.
15 Samsung Display Newsroom. Collect only the information you want! Leverage crawling and big data analytics [Internet]. Available: https://news.samsungdisplay.com/22907.
16 J. H. Kim and E. G. Kim, "WCTT: Web Crawling System based on HTML Document Formalization," Journal of the Korea Institute of Information and Communication Engineering, vol. 26, no. 4, pp. 495-502, Apr. 2022.   DOI
17 DATA ON-AIR. Data collection methods and techniques [Internet]. Available: https://dataonair.or.kr/db-tech-reference/d-guide/data-practical/?mod=document&uid=378.
18 Y. -R. Suh, K. P. Koh, and J. Lee, "An analysis of the change in media's reports and attitudes about face masks during the COVID-19 pandemic in South Korea: a study using Big Data latent dirichlet allocation (LDA) topic modelling," Journal of the Korea Institute of Information and Communication Engineering, vol. 25, no. 5, pp. 731-740, May 2021.   DOI
19 C. Y. An, S. W. Moon, E. H. Shin, and H. Kim, "Study on Effective Web Services for Data Acquisition, Analysis, and Visualization," Journal of D-Culture Archives (JDCA), vol. 4, no. 2, pp. 113-122, Oct. 2021.   DOI