Browse > Article
http://dx.doi.org/10.9717/kmms.2018.21.2.199

Design and Implementation of a Web Crawler System for Collection of Structured and Unstructured Data  

Bae, Seong Won (Dept. of Division of Computer Engineering, Dongseo University)
Lee, Hyun Dong (Industry Academy Cooperation Foundation, Dongseo University)
Cho, DaeSoo (Dept. of Division of Computer Engineering, Dongseo University)
Publication Information
Abstract
Recently, services provided to consumers are increasingly being combined with big data such as low-priced shopping, customized advertisement, and product recommendation. With the increasing importance of big data, the web crawler that collects data from the web has also become important. However, there are two problems with existing web crawlers. First, if the URL is hidden from the link, it can not be accessed by the URL. The second is the inefficiency of fetching more data than the user wants. Therefore, in this paper, through the Casper.js which can control the DOM in the headless brwoser, DOM event is generated by accessing the URL to the hidden link. We also propose an intelligent web crawler system that allows users to make steps to fine-tune both Structured and unstructured data to bring only the data they want. Finally, we show the superiority of the proposed crawler system through the performance evaluation results of the existing web crawler and the proposed web crawler.
Keywords
Web Crawler System; Cluster; Headless Browser Testing Framework; Unstructured Data;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 C.D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.
2 Dustin Boswell, Distributed High-Performance Web Crawlers: A Survey of the State Of the Art, 2003.
3 A. Heydon and M. Najork, “Mercator: A Scalable, Extensible Web Crawler,” World Wide Web, Vol. 2, No. 4, pp. 219-229, 1999.   DOI
4 Allan Heydon and Marc Najork, High-Performance Web Crawling, COMPAQ SRC Reserch Report 173, 2001.
5 S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proceeding of the Seventh International World Wide Web Conference, pp. 107-117, 1998.
6 M.S. Kang and Y.S. Choi, "Design Hadoop Based P2P Distributed Web Crawler," Proceeding of Korean Society For Internet Information, pp. 199-202, 2010.
7 D.M. Seo and H.M. Jung, “Intelligent Web Crawler for Supporting Big Data Analysis Services,” Journal of Korea Contents Association, Vol. 13, No. 12, pp. 575-584, 2013.   DOI
8 Y.H Kim and M.D Chung, "Analysis of Structured and Unstructured Data and Construction of Criminal Profiling System using LSA" Journal of Korea Multimedia Society, Vol. 20, No. 1, pp. 66-73, 2017.   DOI