Browse > Article
http://dx.doi.org/10.5351/KJAS.2014.27.7.1171

Rhipe Platform for Big Data Processing and Analysis  

Jung, Byung Ho (Department of Information Statistics, Gyeongsang National University)
Shin, Ji Eun (Department of Information Statistics, Gyeongsang National University)
Lim, Dong Hoon (Department of Information Statistics, Gyeongsang National University)
Publication Information
The Korean Journal of Applied Statistics / v.27, no.7, 2014 , pp. 1171-1185 More about this Journal
Abstract
Rhipe that integrates R and Hadoop environment, made it possible to process and analyze massive amounts of data using a distributed processing environment. In this paper, we implemented multiple regression analysis using Rhipe with various data sizes of actual data and simulated data. Experimental results for comparing the computing speeds of pseudo-distributed and fully-distributed modes for configuring Hadoop cluster, showed fully-distributed mode was more fast than pseudo-distributed mode and computing speeds of fully-distributed mode were faster as the number of data nodes increases. We also compared the performance of our Rhipe with stats and biglm packages available on bigmemory. The results showed that our Rhipe was more fast than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.
Keywords
Big data; R; Hadoop; Rhipe; multiple regression analysis;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Kane, M. J. and Emerson, J. W. (2010a). bigmemory: Manage massive matrices with shared memory and memory-mapped files, Rpackage version 4.2.3.
2 Kane, M. J. and Emerson, J. W. (2010b). biganalytics: A library of utilities for big.matrix objects of package bigmemory , R package version 1.0.12.
3 Laney, D. (2001)., 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group.
4 Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute.
5 Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
6 Sammer, E.(2012). Hadoop Operations, O'Reilly Media, Inc, Sebastopol, CA.
7 White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Inc, Sebastopol, CA.
8 고영준, 김진석. (2013). Rhipe를 활용한 빅데이터 처리 및 분석, 한국데이터정보과학회지, 24(5), 975-987.   과학기술학회마을   DOI
9 ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html
10 Adler, D., Nenadic, O. Zucchini, W. and Glaser, C. (2007). The ff package: Handling large data sets in R with memory mapped pages of binary flat files, UseR2007, http://www.r-project.org/conferences/useR-2007/program/presentations/adler.pdf
11 Ciliendo, E., Kunimasa, T. and Braswell, B. (2007). Linux Performance and Tuning Guidelines, IBM.
12 Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. PhD thesis, Department of Statistics, Purdue University, West Lafayette.
13 Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 191, 53-67.
14 Hafen, R., Gibson, T., Dam, K. K. and Critchlow, T. (2014). Power grid data analysis with R and Hadoop in Data Mining Applications with R, pp. 1-34.
15 Lin, H., Yang, S. and Midkiff, S. P. (2013). A Parallel R Framework for Processing Large Dataset on Distributed Systems, DataCloud.