Browse > Article
http://dx.doi.org/10.13067/JKIECS.2022.17.1.161

Distributed Processing of Big Data Analysis based on R using SparkR  

Ryu, Woo-Seok (Dept. of Health Care Management, Catholic University of Pusan)
Publication Information
The Journal of the Korea institute of electronic communication sciences / v.17, no.1, 2022 , pp. 161-166 More about this Journal
Abstract
In this paper, we analyze the problems that occur when performing the big data analysis using R as a data analysis tool, and present the usefulness of the data analysis with SparkR which connects R and Spark to support distributed processing of big data effectively. First, we study the memory allocation problem of R which occurs when loading large amounts of data and performing operations, and the characteristics and programming environment of SparkR. And then, we perform the comparison analysis of the execution performance when linear regression analysis is performed in each environment. As a result of the analysis, it was shown that R can be used for data analysis through SparkR without additional language learning, and the code written in R can be effectively processed distributedly according to the increase in the number of nodes in the cluster.
Keywords
Big Data; Data Science; Distributed Processing; Programming Language; SparkR;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 A. Rabasa and C. Heavin, An Introduction to Data Science and its Applications, Data Science and Productivity Analytics. Berlin: Springer Cham, 2020, pp. 57-81.
2 Y. Lim and K. Kim, "Methods to propel Tourism of Yeosu City Using Big Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 15, no. 4, Aug. 2020, pp. 739-746.   DOI
3 K. Goztepe, "De Facto Language of Data Science: The R Project," J. of Management and Information Science, vol. 4, no. 4, Dec. 2016, pp. 104-107.
4 M. Cho, "A Comparative Study on the Accuracy of Important Statistical Prediction Techniques for Marketing Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 775-780.   DOI
5 B. Chambers and M. Zaharia, Spark: The definitive Guide: Big data processing made simple. Newton, MA, USA: O'Reilly Media, Inc, Feb. 2018.
6 M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, Oct. 2016, pp. 56-65.   DOI
7 J. Jang, J. Park, H. Kim, and S. Yoon, "A Comparative Performance Analysis of Spark-Based Distributed Deep-Learning Frameworks," KIISE(Korean Institute of Information Scientists and Engineers) Trans. Computing Practices, vol. 23, no. 5, May, 2017, pp. 299-303.
8 M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, "Big data machine learning using apache spark MLlib," In 2017 IEEE Int. Conf. on Big Data (Big Data), Boston, MA, U.S.A., 2017, pp. 3492-3498.
9 W. Ryu, "Performance Factor of Distributed Processing of Machine Learning using Spark," J. of the Korea Institute of Electronic Communication Sciences, vol. 16, no. 1, Feb. 2021, pp. 19-24.   DOI
10 R. Myung, H. Yu, and S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS(Korea Information Processing Society) Trans. Computer and Communication Systems, vol. 7, no. 1, Jan. 2018, pp. 9-18.