[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13067/JKIECS.2021.16.1.19

Performance Factor of Distributed Processing of Machine Learning using Spark

Ryu, Woo-Seok (Dept. of Health Care Management, Catholic University of Pusan)

Publication Information

The Journal of the Korea institute of electronic communication sciences / v.16, no.1, 2021 , pp. 19-24 More about this Journal

Abstract

In this paper, we study performance factor of machine learning in the distributed environment using Apache Spark and presents an efficient distributed processing method through experiments. This work firstly presents performance factor when performing machine learning in a distributed cluster by classifying cluster performance, data size, and configuration of spark engine. In addition, performance study of regression analysis using Spark MLlib running on the Hadoop cluster is performed while changing the configuration of the node and the Spark Executor. As a result of the experiment, it was confirmed that the effective number of executors was affected by the number of data blocks, but depending on the cluster size, the maximum and minimum values were limited by the number of cores and the number of worker nodes, respectively.

Keywords

Spark; Cluster; Machine Learning; Distributed Processing;

Citations & Related Records

Reference

1	Y. Jeong and K. Choi, "For Gene disease Analysis using Data Mining Implement MKSV system," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 781-786. DOI
2	N. Shahid, T. Rappon, and W. Berta, "Applications of artificial neural networks in health care organizational decision-making: A scoping review," PLOS ONE, vol. 14, no. 2, Feb. 2019, pp. 1-22.
3	Y. Bae and D. Hwang, "The prediction of bidding price using deep learning in the electronic bidding," J. of the Korea Institute of Electronic Communication Sciences, vol. 15 no. 1, Feb. 2020, pp. 147-152. DOI
4	I. Mavridis and H. Karatza, "Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark," J. of Systems and Software, vol. 125, Mar. 2017, pp. 133-151. DOI
5	M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, Oct. 2016, pp. 56-65. DOI
6	J. Jo, "Performance Comparison Analysis of AI Supervised Learning Methods of Tensorflow and Scikit-Learn in the Writing Digit Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 701-706. DOI
7	R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and O. Yilmazel, "Apache Mahout: Machine Learning on Distributed Dataflow Systems," J. of Machine Learning Research, vol. 21, no. 127, Jan. 2020, pp. 1-6.
8	M. Frampton, Mastering Apache Spark. Birmingham: Packt Publishing Ltd, 2015.
9	A. Garate-Escamilla, A. Hassani, and E. Andres, "Big data scalability based on Spark Machine Learning Libraries," In Proc. the 3rd International Conference on Big Data Research, Cergy-Pontoise, France, Nov. 2019, pp. 166-171.
10	J. Jang, J. Park, H. Kim, and S. Yoon, "A Comparative Performance Analysis of Spark-Based Distributed Deep-Learning Frameworks," KIISE Trans. Computing Practices, vol. 23, no. 5, May 2017, pp. 299-303. DOI
11	R. Myung, H. Yu, and S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS Trans. Computer and Communication Systems, vol. 7, no. 1, Jan. 2018, pp. 9-18. DOI
12	M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, "Big data machine learning using apache spark MLlib," In 2017 IEEE Int. Conf. on Big Data (Big Data), Boston, MA, U.S.A., 2017, pp. 3492-3498.

KSCI

Performance Factor of Distributed Processing of Machine Learning using Spark 스파크를 이용한 머신러닝의 분산 처리 성능 요인

Performance Factor of Distributed Processing of Machine Learning using Spark