Browse > Article
http://dx.doi.org/10.7465/jkdi.2016.27.4.911

Learning algorithms for big data logistic regression on RHIPE platform  

Jung, Byung Ho (Department of Information and Statistics, Gyeongsang National University)
Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)
Publication Information
Journal of the Korean Data and Information Science Society / v.27, no.4, 2016 , pp. 911-923 More about this Journal
Abstract
Machine learning becomes increasingly important in the big data era. Logistic regression is a type of classification in machine leaning, and has been widely used in various fields, including medicine, economics, marketing, and social sciences. Rhipe that integrates R and Hadoop environment, has not been discussed by many researchers owing to the difficulty of its installation and MapReduce implementation. In this paper, we present the MapReduce implementation of Gradient Descent algorithm and Newton-Raphson algorithm for logistic regression using Rhipe. The Newton-Raphson algorithm does not require a learning rate, while Gradient Descent algorithm needs to manually pick a learning rate. We choose the learning rate by performing the mixed procedure of grid search and binary search for processing big data efficiently. In the performance study, our Newton-Raphson algorithm outpeforms Gradient Descent algorithm in all the tested data.
Keywords
Big data; Hadoop; logistic regression; R; RHIPE;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Arnulf, B. A., Graf, Alexander J. S. and Borer, S. (2003). Classification in a normalized feature space using support vector machines. IEEE, 14, 597-605.
2 ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html.
3 Ciliendo, E., Kunimasa, T. and Braswell, B. (2007). Linux performance and tuning guidelines, IBM redbooks, IBM, International Technical Support Organization, USA.
4 Davenport, T. (2015). B.I.G. Forum 2015. Big data initiative Gyeonggi, Gyeonggi Creative Economy & Innovation Center, Gyeonggi Province, Korea.
5 Forte, R. M. (2015). Mastering predictive analytics with R, Packt Publishing Ltd, Birmingham, UK.
6 Guha, S. (2010). Computing environment for the statistical analysis of large and complex data, Ph. D. Thesis, Department of Statistics, Purdue University, West Lafayette, Indiana, USA.
7 Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: Divide and recombine (D&R) with RHIPE. Stat, 191, 53-67
8 Hafen, R., Gibson, T., Dam, K. K. and Critchlow. T. (2014). Power grid data analysis with R and Hadoop. in data mining applications with R, 1-34.
9 Hilbe, J. M. (2009). Logistic regression models, Chapman & Hall/CRC Press, Florida, USA.
10 Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185.   DOI
11 Jung, B. H. (2016). A study on machine learning algorithms using distributed processing system of big data, Ph. D. Thesis, Gyeongsang National University, Jinju, Korea.
12 Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe. Journal of the Korean Data & Information science Society, 24, 975-987.   DOI
13 Lin, H., Yang, S. and Midkiff, S. P. (2013). RABID-A general distributed R processing framework targeting large data-set problems, IEEE International Congress on Big Data, Santa Clara, CA, USA.
14 Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
15 Rashid, M. (2008). Inference on logistic regression, Ph. D. Thesis, Bowling green state university, Ohio, USA.
16 Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
17 Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information science Society, 26, 1155-1166.   DOI
18 Tzafestas, A. G. (1992). Robotic systems: Advanced techniques and applications, Kluwer Academic Publishers, Dordrecht, Netherlands.
19 Wang, C., Chen, M. H., Schifano, Wu, J. and Yan, J. (2015). A survey of statistical methods and computing for big data, Cornell university library, Available at http://de.arxiv.org/abs/1502.07989v1.
20 White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.
21 Wu, J. and Coggeshall, S. (2012). Foundations of predictive analytics, Chapman and Hall/CRC Press, Florida, USA.