Browse > Article
http://dx.doi.org/10.9728/dcs.2017.18.5.933

Web Crawler Service Implementation for Information Retrieval based on Big Data Analysis  

Kim, Hye-Suk (Department of Electronics and Computer Engineering, Chonnam National University)
Han, Na (Department of Electronics and Computer Engineering, Chonnam National University)
Lim, Suk-Ja (Department of Advertisement Design, Gwangju Campus of Korea Polytechnic)
Publication Information
Journal of Digital Contents Society / v.18, no.5, 2017 , pp. 933-942 More about this Journal
Abstract
In this paper, we propose a web crawler service method for collecting information efficiently about college students and job-seeker's external activities, competition, and scholarship. The proposed web crawler service uses Jsoup tree analysis and Json format data transmission method to avoid problems of duplicated crawling while crawling at high speed. After collecting relevant information for 24 hours, we were able to confirm that the web crawler service is running with an accuracy of 100%. It is expected that the web crawler service can be applied to various web sites in the future to improve the web crawler service.
Keywords
Big Data; Web Crawler; Web Crawling; Filter; Search;
Citations & Related Records
Times Cited By KSCI : 8  (Citation Analysis)
연도 인용수 순위
1 Chris Snijders, Uwe Matzat and Ulf-Dietrich Reips, "'Big Data': Big Gaps of Knowledge," International Journal of Internet Science, Vol. 7, No. 1, pp. 1-5, 2012.
2 S. Y. Bang, H. D. Ha and C. J. Kim, "A Study on BigData-based Software Architecture Design for Utilizing Public Open Data," Journal of Korean Institute of Information Technology, Vol. 13, No. 10, pp. 99-107, Oct. 2015.   DOI
3 S. G. Lee, S. Y. Lee and J. C. Kim, "Design of a Platform for Collecting and Analyzing Agricultural Big Data," Journal of Digital Contents Society, Vol. 18, No. 1, pp. 149-158, Feb. 2017.   DOI
4 W. S. Cho and J. E. Lee and C. H. Choi, "Refresh Cycle Optimization for Web Crawlers," The Journal of the Korea Contents Association, Vol. 13, No. 6, pp. 30-39, 2013.   DOI
5 K. S. Park, J. H. Choi, J. B. Kim and J. W. Park, "Design and Implementation of a Search Engine based on Apache Spark," Journal of the Korea Institute of Information and Communication Engineering, Vol. 21, No. 1, pp. 17-28, January 2017.   DOI
6 D. M. Seo and H. M. Juung, "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, Vol. 13, No. 12, pp. 575-584, 2013.   DOI
7 Hyafil, Laurent and Ronald L. Rivest, "Constructing optimal binary decision trees is NP-complete," Information Processing Letters 5.1, pp. 15-17, 1976.   DOI
8 J. Y. Kim, D. H. Han and J. M. Kim, "Impact of Diverse Document-evaluation Measure-based Searching Methods in Big Data Search Accuracy," The Journal of the Korea Information Science Society, Vol. 44, No. 5, pp. 553-558, May 2017.
9 Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman, Mining of massive datasets, Cambridge University Press, 2014.
10 Russell, Stuart Jonathan, et al., Artificial intelligence: a modern approach, Vol. 2, Upper Saddle River: Prentice hall, 2003.
11 S. J. Kim, "A Comparative Study on Models of Web-based Information Seeking Behavior," Journal of the Korean Society for Infromation Management, Vol. 21, No. 2, pp. 211-233, June 2004.   DOI
12 M. L. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti, "GoGetIt!: a tool for generating structure-driven web crawlers," In Proc. 15th international conference on World Wide Web, pp.1011-1012, 2006.
13 Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
14 H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, Vol. 4, No. 2, pp.127-137, December. 2003.
15 C. Kohlschutter, P. Fankhauser, and W. Nejdl, "Boilerplate Detection using Shallow Text Features," In Proc. of ACM International Conference on Web Search and Data Mining, pp.441-450, 2010.
16 D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, Vol. 13, No. 12, pp.575-584, December. 2013.   DOI
17 D. Cai, S. Yu, J. R. Wen and W. Y. Ma, "VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report, 2003.