Browse > Article
http://dx.doi.org/10.7465/jkdi.2017.28.4.755

Comparison analysis of big data integration models  

Jung, Byung Ho (Gyeongsangnamdo Provincial Government)
Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)
Publication Information
Journal of the Korean Data and Information Science Society / v.28, no.4, 2017 , pp. 755-768 More about this Journal
Abstract
As Big Data becomes the core of the fourth industrial revolution, big data-based processing and analysis capabilities are expected to influence the company's future competitiveness. Comparative studies of RHadoop and RHIPE that integrate R and Hadoop environment, have not been discussed by many researchers although RHadoop and RHIPE have been discussed separately. In this paper, we constructed big data platforms such as RHadoop and RHIPE applicable to large scale data and implemented the machine learning algorithms such as multiple regression and logistic regression based on MapReduce framework. We conducted a study on performance and scalability with those implementations for various sample sizes of actual data and simulated data. The experiments demonstrated that our RHadoop and RHIPE can scale well and efficiently process large data sets on commodity hardware. We showed RHIPE is faster than RHadoop in almost all the data generally.
Keywords
Big data; multiple regression; logistic regression; RHadoop; RHIPE;
Citations & Related Records
Times Cited By KSCI : 6  (Citation Analysis)
연도 인용수 순위
1 Jung, B. H. and Lim, D. H. (2016). Learning algorithms for big data logistic regression on RHIPE platform. The Korean Journal of Applied Statistics, 27, 911-923.
2 Liang, S. (2003). Quantitative remote sensing of land surfaces, John Wiley & Sons.
3 Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe. Journal of the Korean Data & Information Science, 24, 975-987.   DOI
4 Rotte, A. V., Patwari, G., Hiremath, S. (2015). Big data analytics made easy with rhadoop. International Journal of Research in Engineering and Technology, 4, 9-15.
5 Prakash, L. and Bejda, M. (2015). Performance analysis for scaling up R computations using Hadoop, B.S. in Computer Science, The University of Texas at Austin.
6 Lin, H., Yang, S., Midkiff, S. P. (2013). RABID - A general distributed R processing framework targeting large data-set problems. IEEE International Congress on Big Data, Santa Clara, CA, USA.
7 Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian statistical review, 2, 83-94.
8 Park, J. H., Lee, S. Y., Kang, D. H., Won, J. H. (2013). Hadoop and Mapreduce. Journal of the Korean Data & Information Science, 24, 1013-1027.   DOI
9 Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
10 Rashid, M. (2008). Inference on logistic regression, Ph. D. Thesis, Bowling Green State University.
11 Sammer, E. (2012). Hadoop operations, O'Reilly Media, Inc., Sebastopol, CA.
12 Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166.   DOI
13 Shin, J. E., Oh, Y. S. and Lim, D. H. (2016). RHadoop platform for K-Means clustering of big data. Journal of the Korean Data & Information Science, 27, 609-619.   DOI
14 Wang, C., Chen, M. H., Schifano, Wu, J. and Yan, J. (2015). A survey of statistical methods and computing for Big Data, Cornell University Library.
15 Forte, R. M. (2015). Mastering predictive analytics with R, Packt Publishing Ltd, Birmingham, U.K.
16 White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.
17 ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html
18 Davenport, T. (2015). B. I. G. forum 2015, Gyeonggi Creative Economy & Innovation Center.
19 Guha, S. (2010). Computing environment for the statistical analysis of large and complex data, Ph.D Thesis, Department of Statistics, Purdue University, West Lafayette.
20 Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Statistics, 191, 53-67.
21 Hafen, R., Gibson, T., Dam, K. K., Critchlow., T. (2014). Power grid data analysis with R and Hadoop in data mining applications with R, 1-34.
22 Harish, D., Anusha, M.S., Dr. Daya Sagar, K.V. (2015). Big data analysis using Rhadoop. IJIRAE, 4, 180-185.
23 Hilbe, J. M. (2009). Logistic regression models, Chapman & Hall/CRC Press.
24 IDC. (2015). IDC FutureScape: Worldwide big data and analytics 2016 predictions, MA, USA.
25 Jee, Y. S. (2017). Exercise rehabilitation in the fourth industrial revolution. Journal of Exercise Rehabilitation, 13, 255-256.   DOI
26 Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis. The Korean Journal of Applied Statistics, 27, 1171-1185.   DOI