[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.6109/jkiice.2017.21.1.17

Design and Implementation of a Search Engine based on Apache Spark

Park, Ki-Sung (Graduate School of Software, Soongsil University)
Choi, Jae-Hyun (Graduate School of Software, Soongsil University)
Kim, Jong-Bae (Graduate School of Software, Soongsil University)
Park, Jae-Won (Graduate School of Software, Soongsil University)

Publication Information

Journal of the Korea Institute of Information and Communication Engineering / v.21, no.1, 2017 , pp. 17-28 More about this Journal

Abstract

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection.

Keywords

Search Engine; Crawler; Nutch; Spark; Solr;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)

Reference
Cited By KSCI

1	USCDataScience. Sparkler [Internet]. Available: https://github.com/USCDataScience/sparkler.
2	H. O. Song, A. Y. Kim, and H. K. Jung. "Implement on Search Machine using Open Source Framework," Journal of the Korea Institute of Information and Communication Engineering, vol. 19, no. 3, pp.552-557, Mar. 2015. DOI
3	M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose: CA, pp.2-2, 2012.
4	C. Klaussne, J. Nioch. (2013, September). Nutch fight! 1.7 vs 2.2.1 [Internet]. Available: http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
5	Wikipedia. Web Crawler [Internet] Available: https://en.wikipedia.org/wiki/Web_crawler.
6	H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark, 1st ed. Sebastopol, CA: O'Reilly Media, pp.1-9, 2015.
7	F. Pant, P. Srinivasn, F. Menczer, "Crawling the Web" in Web Dynamics, 1st ed. Berlin, Germany: Springer-Verlag, pp.153-177, 2003.
8	H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, vol. 4, no. 2, pp.127-137, Dec. 2003.
9	M. S. Ahuja , J. Singh, and B. Varnica. "Web Crawler: Extracting the Web Data," International Journal of Computer Trends and Technology(IJCTT), vol. 13, no. 3, pp.132-137, Jul. 2014. DOI
10	Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
11	D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, vol. 13, no. 12, pp.575-584, Dec. 2013. DOI
12	K. Y. Kim, W. G. Lee, H. M. Yoon, S. H. Shin, and M. H. Lee. "Development of Web Crawler for Archiving Web Resources," The Journal of the Korea Contents Association, vol. 11, no. 9, pp.9-16, Sep. 2011. DOI
13	B. S. Kim, "Performance Evaluation of HDFS Based SQLOn-Hadoop," M.S. Thesis, Chungbuk National University, Cheongju, Korea, 2015.
14	S. H. Hong, "An Implementation of Smart Price Tracker System Using Web Crawling," M.S. Thesis, Seoul National University of Science and Technology, Seoul, Korea, 2015.
15	V. K. Vavilapalli, et al., "Apache hadoop yarn: Yet another resource negotiator," in ACM Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara: CA, 2013.
16	Y. K. Lee, "The Comparison Between Hadoop MapReduce and Spark Device's Machine Learning Performance," M.S. Thesis, Soongsil University, Seoul, Korea, 2015.
17	J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp.107-113, Jan. 2008. DOI

5	(2017) 디지털콘텐츠학회 논문지 빅데이터 분석 기반의 정보 검색을 위한 웹 크롤러 서비스 구현 / 18 (5) , 933
9	(2017) 한국정보통신학회논문지 K Nearest Neighbor Joins for Big Data Processing based on Spark / 21 (9) , 1731

KSCI

Design and Implementation of a Search Engine based on Apache Spark 아파치 스파크 기반 검색엔진의 설계 및 구현

Design and Implementation of a Search Engine based on Apache Spark