[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7465/jkdi.2016.27.3.609

RHadoop platform for K-Means clustering of big data

Shin, Ji Eun (Department of Information and Statistics, Gyeongsang National University)
Oh, Yoon Sik (Division of Biological Sciences, Gyeongsang National University)
Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)

Publication Information

Journal of the Korean Data and Information Science Society / v.27, no.3, 2016 , pp. 609-619 More about this Journal

Abstract

RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. In this paper, we implement K-Means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. The main idea introduces a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. We showed that our K-Means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases. We also implemented Elbow method with MapReduce for finding the optimum number of clusters for K-Means clustering on large dataset. Comparison with our MapReduce implementation of Elbow method and classical kmeans() in R with small data showed similar results.

Keywords

Big data; Hadoop; K-Means clustering; R; RHadoop;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)

Reference
Cited By KSCI

1	Anchalia, P. P. (2014). Improved MapReduce k-means clustering algorithm with combiner. 16th International Conference on Computer Modelling and Simulation, 386-391.
2	ASA Data Expo. (2009). Airline on-time performance, ASA section on: Statistical computing statistical graphics, http://stat-computing.org/dataexpo/2009/the-data.html.
3	Ciliendo, E. and Kunimasa, T. (2007). Linux performance and tuning guidelines, International Technical Support Organization, IBM, ibm.com/redbooks.
4	Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Ph. D. Thesis, Purdue University, West Lafayette.
5	Harish, D., Anusha, M.S. and Dr. Daya Sagar, K. V. (2015). Big data analysis using Rhadoop. International Journal of Innovative Research in Advanced Engineering, 4, 180-185.
6	Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185. DOI
7	Kane, M. J. and Emerson, J. W. (2010a). biganalytics: A library of utilities for big.matrix objects of package bigmemory, R package version 1.0.12, http://CRAN.R-project.org/package=biganalytics.
8	Kane, M. J. and Emerson, J. W. (2010b). bigmemory: Manage massive matrices with shared memory and memory-mapped files, R package version 4.2.3, http://CRAN.R-project.org/package=bigmemory.
9	Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe, Journal of the Korean Data & Information Science, 24, 975-987. DOI
10	Kodinariya, T. M. and Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90-95.
11	Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian Statistical Review, 2. 83-94.
12	Park, J. H., Lee, S. Y., Kang D. H. and Won, J. H. (2013). Hadoop and MapReduce, Journal of the Korean Data & Information Science, 24, 1013-1027. DOI
13	Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
14	Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
15	Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166. DOI
16	White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.

4	(2016) Journal of the Korean Data & Information Science Society 빅데이터 통합모형 비교분석 / 28 (4) , 755
6	(2016) Journal of the Korean Data & Information Science Society 고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법 / 28 (6) , 1291
6	(2016) Journal of the Korean Data & Information Science Society 제조 빅데이터 시스템을 위한 효과적인 시각화 기법 / 28 (6) , 1301

KSCI

RHadoop platform for K-Means clustering of big data 빅데이터 K-평균 클러스터링을 위한 RHadoop 플랫폼

RHadoop platform for K-Means clustering of big data