Browse > Article
http://dx.doi.org/10.6109/jkiice.2017.21.1.17

Design and Implementation of a Search Engine based on Apache Spark  

Park, Ki-Sung (Graduate School of Software, Soongsil University)
Choi, Jae-Hyun (Graduate School of Software, Soongsil University)
Kim, Jong-Bae (Graduate School of Software, Soongsil University)
Park, Jae-Won (Graduate School of Software, Soongsil University)
Abstract
Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection.
Keywords
Search Engine; Crawler; Nutch; Spark; Solr;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 USCDataScience. Sparkler [Internet]. Available: https://github.com/USCDataScience/sparkler.
2 H. O. Song, A. Y. Kim, and H. K. Jung. "Implement on Search Machine using Open Source Framework," Journal of the Korea Institute of Information and Communication Engineering, vol. 19, no. 3, pp.552-557, Mar. 2015.   DOI
3 M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose: CA, pp.2-2, 2012.
4 C. Klaussne, J. Nioch. (2013, September). Nutch fight! 1.7 vs 2.2.1 [Internet]. Available: http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
5 Wikipedia. Web Crawler [Internet] Available: https://en.wikipedia.org/wiki/Web_crawler.
6 H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark, 1st ed. Sebastopol, CA: O'Reilly Media, pp.1-9, 2015.
7 F. Pant, P. Srinivasn, F. Menczer, "Crawling the Web" in Web Dynamics, 1st ed. Berlin, Germany: Springer-Verlag, pp.153-177, 2003.
8 H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, vol. 4, no. 2, pp.127-137, Dec. 2003.
9 M. S. Ahuja , J. Singh, and B. Varnica. "Web Crawler: Extracting the Web Data," International Journal of Computer Trends and Technology(IJCTT), vol. 13, no. 3, pp.132-137, Jul. 2014.   DOI
10 Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
11 D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, vol. 13, no. 12, pp.575-584, Dec. 2013.   DOI
12 K. Y. Kim, W. G. Lee, H. M. Yoon, S. H. Shin, and M. H. Lee. "Development of Web Crawler for Archiving Web Resources," The Journal of the Korea Contents Association, vol. 11, no. 9, pp.9-16, Sep. 2011.   DOI
13 B. S. Kim, "Performance Evaluation of HDFS Based SQLOn-Hadoop," M.S. Thesis, Chungbuk National University, Cheongju, Korea, 2015.
14 S. H. Hong, "An Implementation of Smart Price Tracker System Using Web Crawling," M.S. Thesis, Seoul National University of Science and Technology, Seoul, Korea, 2015.
15 V. K. Vavilapalli, et al., "Apache hadoop yarn: Yet another resource negotiator," in ACM Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara: CA, 2013.
16 Y. K. Lee, "The Comparison Between Hadoop MapReduce and Spark Device's Machine Learning Performance," M.S. Thesis, Soongsil University, Seoul, Korea, 2015.
17 J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp.107-113, Jan. 2008.   DOI