Browse > Article
http://dx.doi.org/10.22156/CS4SMB.2019.9.1.045

Design and Implemention of Real-time web Crawling distributed monitoring system  

Kim, Yeong-A (Department of Computer Science & Engineering, GNTECH)
Kim, Gea-Hee (Department of Computer Science & Engineering, GNTECH)
Kim, Hyun-Ju (Department of Computer Science & Engineering, GNTECH)
Kim, Chang-Geun (Department of Computer Science & Engineering, GNTECH)
Publication Information
Journal of Convergence for Information Technology / v.9, no.1, 2019 , pp. 45-53 More about this Journal
Abstract
We face problems from excessive information served with websites in this rapidly changing information era. We find little information useful and much useless and spend a lot of time to select information needed. Many websites including search engines use web crawling in order to make data updated. Web crawling is usually used to generate copies of all the pages of visited sites. Search engines index the pages for faster searching. With regard to data collection for wholesale and order information changing in realtime, the keyword-oriented web data collection is not adequate. The alternative for selective collection of web information in realtime has not been suggested. In this paper, we propose a method of collecting information of restricted web sites by using Web crawling distributed monitoring system (R-WCMS) and estimating collection time through detailed analysis of data and storing them in parallel system. Experimental results show that web site information retrieval is applied to the proposed model, reducing the time of 15-17%.
Keywords
Web Crawling; big data; hadoop; spark; kafka; Parallel systems; Monitoring;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Y. S. Jeong. (2015). Business Process Model for Efficient SMB using Big Data. Journal of Convergence for Information Technology, 5(4) , 11-16.   DOI
2 J. H. Ku. (2018). A Study on Adaptive Learning Model for Performance Improvement of Stream Analytics. Journal of Convergence for Information Technology, 8(1), 201-206.   DOI
3 G. Pant & F. Menczer. (2002). MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5(2), 221-229.   DOI
4 E. J. Shin, Y. R. Kim, H. S. Heo & K. Y. Whang. (2008). Implementation of a Parallel Web Crawler for the Odysseus Large-Scale Search Engine. Journal of Computing Science and Engineering, 14(6) , 567-581.
5 M. Zaharia, M. Chowdhury, M. J. Franklin. (2010). Scott Shenker, and Ion Stoica, Spark: Cluster Computing with Working Set. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10(10-10), 95.
6 Kafka. https://kafka.apache.org/intro
7 K. Y. Kim, W. Lee, M. H. Lee, H.M.Yoon & S. H. Shin(2011). Development of Web Crawler for Archiving Web Resources, International J ournal of contents, 11(9), 9-16. DOI : 10.5392/JKDA
8 J. h. Cho & H. Garcia-Molina. (2009), Parallel crawlers , Proceedings of the 11th international conference on World Wide Web. Honolulu, Hawaii, USA:ACM. pp.(124-135). DOI :10.1145/511446.511464.ISBN
9 H. J. Kim, J. Y Lee & S. S Shin. (2017), Multi-threaded Web Crawling Design using Queues. Journal of Convergence for Information Technology, 7(2) , 43-51.   DOI
10 H. J. Mun. (2015). Polling Method based on Weight Table for Efficient Monitoring. Journal of Convergence for Information Technology, 5(4), 5-10.   DOI
11 Olston, Christopher. et al. (2010). Foundations and Trends(R) in Information Retrieval, 4(3), 17. DOI : 10.1561/1500000017   DOI