DOI QR코드

DOI QR Code

Performance Comparison of Logistic Regression Algorithms on RHadoop

  • Jung, Byung Ho (Department of Information Statistics, Gyeongsang National University) ;
  • Lim, Dong Hoon (Department of Information Statistics, RINS, Gyeongsang National University)
  • Received : 2017.01.18
  • Accepted : 2017.04.13
  • Published : 2017.04.28

Abstract

Machine learning has found widespread implementations and applications in many different domains in our life. Logistic regression is a type of classification in machine leaning, and is used widely in many fields, including medicine, economics, marketing and social sciences. In this paper, we present the MapReduce implementation of three existing algorithms, this is, Gradient Descent algorithm, Cost Minimization algorithm and Newton-Raphson algorithm, for logistic regression on RHadoop that integrates R and Hadoop environment applicable to large scale data. We compare the performance of these algorithms for estimation of logistic regression coefficients with real and simulated data sets. We also compare the performance of our RHadoop and RHIPE platforms. The performance experiments showed that our Newton-Raphson algorithm when compared to Gradient Descent and Cost Minimization algorithms appeared to be better to all data tested, also showed that our RHadoop was better than RHIPE in real data, and was opposite in simulated data.

Keywords

References

  1. J. M. Hilbe, "Logistic Regression Models," Chapman & Hall/CRC Press, 2009.
  2. B. Oancea, and R. M. Dragoescu, "Integration R and Hadoop for Big data analysis," Romanian Statistical Review, No. 2. pp. 83-94, 2014.
  3. T. White, "Hadoop: The Definitive Guide," O'Reilly Media, Inc., Sebastopol, CA. 2012.
  4. V. Prajapati, "Big data analytics with R and Hadoop," Packt Publishing Ltd, Birmingham, UK. 2013.
  5. S. Guha, "Computing environment for the statistical analysis of large and complex data," PhD thesis, Department of Statistics, Purdue University, West Lafayette, 2010.
  6. S. Guha, R. R., Hafen, J. Rounds, J. Xia, J. Li, B. Xi, W. S. Cleveland, "Large complex data: divide and recombine (D&R) with RHIPE," Stat. 191, pp. 53-67, 2012.
  7. Y. Ko, and J. Kim, "Analysis of big data using Rhipe," Journal of the Korean Data & Information Science Society 24(5), pp. 975-987, 2013. https://doi.org/10.7465/jkdi.2013.24.5.975
  8. B. H. Jung, J. E. Shin, D. H. Lim. "Rhipe Platform for Big Data Processing and Analysis," The Korean journal of applied statistics, 27(7), pp. 1171-1185, 2014. https://doi.org/10.5351/KJAS.2014.27.7.1171
  9. D. Harish, M.S. Anusha, Dr. K.V. Daya Sagar, "Big data analysis using Rhadoop," IJIRAE 4(2), 180-185, 2015.
  10. J. E. Shin, B. H. Jung, D. H. Lim. "Big data distributed processing system using RHadoop," Journal of the Korean Data & Information Science Society. 26(5), pp. 1155-1166, 2015. https://doi.org/10.7465/jkdi.2015.26.5.1155
  11. R. Uskenbаyevа, А. Kuаndykov, Y. I. Cho, T. Temirbolаto а, S. Аmаnzholovа, D. Kozhаmzhаrovа, "Integrаting of dаtа using the Hаdoop аnd R", Procedia Computer Science 56. pp.145 - 149, 2015. https://doi.org/10.1016/j.procs.2015.07.187
  12. J. Wu, and S. Coggeshall. "Foundations of Predictive Analytics." Chapman and Hall/CRC. 2012.
  13. E. Sammer. "Hadoop Operations," O'Reilly Media, Inc., Sebastopol, CA. 2012
  14. ASA data expo. "Airline on-time performance," ASA Section on: Statistical Computing Statistical Graphics, http://stat-computing.org/dataexpo/2009/the-data.html. 2009
  15. C. Wang, M. H. Chen, E. Schifano, J. Wu, J. Yan, "A survey of statistical methods and computing for Big Data." Cornell University Library. 2015.
  16. M. Rashid, "Inference on logistic regression models." Doctor of philosophy thesis, Bowling Green State University. 2008.