An Implementation and Performance Evaluation of Fast Web Crawler with Python

  • Received : 2019.09.26
  • Accepted : 2019.09.27
  • Published : 2019.09.30

Abstract

The Internet has been expanded constantly and greatly such that we are having vast number of web pages with dynamic changes. Especially, the fast development of wireless communication technology and the wide spread of various smart devices enable information being created at speed and changed anywhere, anytime. In this situation, web crawling, also known as web scraping, which is an organized, automated computer system for systematically navigating web pages residing on the web and for automatically searching and indexing information, has been inevitably used broadly in many fields today. This paper aims to implement a prototype web crawler with Python and to improve the execution speed using threads on multicore CPU. The results of the implementation confirmed the operation with crawling reference web sites and the performance improvement by evaluating the execution speed on the different thread configurations on multicore CPU.

Keywords

References

  1. Jonathan M. Hsieh, Steven D. Gribble, and Henry M. Levy, "The architecture and implementation of an extensible web crawler", NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation, San Jose, California, April 28 - 30, 2010.
  2. Keerthi S. Shetty, Swaraj Bhat and Sanjay Singh, "Symbolic Verification of Web Crawler Functionality and Its Properties", International Conference on Computer Communication and Informatics (ICCCI - 2012), Coimbatore, INDIA, IEEE Conference Publications, 2012.
  3. How to Perform a Website Design & Optimization Audit. April 3, 2018. Available online: https://www.brightscout.com/how-to-perform-a-wesite-audit/s
  4. Cheong Ghil Kim, "A Study of the Performance Prediction Models of Mobile Graphics Processing Unit", Journal of the Semiconductor & Display Technology, Vol.18 No.1, pp. 1-5, 2019.
  5. D. Choi, W. Han, Y. Lee, and Y. Kim, "Learning Methods for Effective Object Tracking in 3D Storytelling Augmented Reality", Journal of the Semiconductor & Display Technology, Vol. 15, No. 3., pp. 46-50, September 2016.
  6. Yong-Hwan Lee and Heung-Jun Kim, "Evaluation of Feature Extraction and Matching Algorithms for the use of Mobile Application", Journal of the Semiconductor & Display Technology, Vol. 14, No. 4., pp. 56-60, December 2015.
  7. Trupti V. Udapure, Ravindra D. Kale, Rajesh C. Dharmik, "Study of Web Crawler and its Different Types", IOSR Journal of Computer Engineering, Vol. 16, Issue 1, Ver. VI (Feb. 2014), PP 01-05
  8. Christopher Olston and Marc Najork, "Web Crawling", Foundations and Trends in Information Retrieval, Vol. 4, No. 3, pp. 175-246, 2010. https://doi.org/10.1561/1500000017
  9. Richard Lawson, "Web Scraping with Python", Packt Publishing, Birmingham, England, 2015.
  10. Python: Available at https://docs.python.org/
  11. Beautiful Soup: Available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  12. myrealtrip: Availabe at https://www.myrealtrip.com/
  13. Tourtips: Availabe at http://www.tourtips.com/
  14. EarthTory: Availabe at https://www.earthtory.com/