Browse > Article
http://dx.doi.org/10.7465/jkdi.2016.27.3.609

RHadoop platform for K-Means clustering of big data  

Shin, Ji Eun (Department of Information and Statistics, Gyeongsang National University)
Oh, Yoon Sik (Division of Biological Sciences, Gyeongsang National University)
Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)
Publication Information
Journal of the Korean Data and Information Science Society / v.27, no.3, 2016 , pp. 609-619 More about this Journal
Abstract
RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. In this paper, we implement K-Means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. The main idea introduces a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. We showed that our K-Means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases. We also implemented Elbow method with MapReduce for finding the optimum number of clusters for K-Means clustering on large dataset. Comparison with our MapReduce implementation of Elbow method and classical kmeans() in R with small data showed similar results.
Keywords
Big data; Hadoop; K-Means clustering; R; RHadoop;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 Anchalia, P. P. (2014). Improved MapReduce k-means clustering algorithm with combiner. 16th International Conference on Computer Modelling and Simulation, 386-391.
2 ASA Data Expo. (2009). Airline on-time performance, ASA section on: Statistical computing statistical graphics, http://stat-computing.org/dataexpo/2009/the-data.html.
3 Ciliendo, E. and Kunimasa, T. (2007). Linux performance and tuning guidelines, International Technical Support Organization, IBM, ibm.com/redbooks.
4 Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Ph. D. Thesis, Purdue University, West Lafayette.
5 Harish, D., Anusha, M.S. and Dr. Daya Sagar, K. V. (2015). Big data analysis using Rhadoop. International Journal of Innovative Research in Advanced Engineering, 4, 180-185.
6 Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185.   DOI
7 Kane, M. J. and Emerson, J. W. (2010a). biganalytics: A library of utilities for big.matrix objects of package bigmemory, R package version 1.0.12, http://CRAN.R-project.org/package=biganalytics.
8 Kane, M. J. and Emerson, J. W. (2010b). bigmemory: Manage massive matrices with shared memory and memory-mapped files, R package version 4.2.3, http://CRAN.R-project.org/package=bigmemory.
9 Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe, Journal of the Korean Data & Information Science, 24, 975-987.   DOI
10 Kodinariya, T. M. and Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90-95.
11 Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian Statistical Review, 2. 83-94.
12 Park, J. H., Lee, S. Y., Kang D. H. and Won, J. H. (2013). Hadoop and MapReduce, Journal of the Korean Data & Information Science, 24, 1013-1027.   DOI
13 Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
14 Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
15 Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166.   DOI
16 White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.