[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2022.11.009

An Analytic solution for the Hadoop Configuration Combinatorial Puzzle based on General Factorial Design

Priya, R. Sathia (Ramanujan Computing Centre, College of Engineering, Anna University, Department of Computer Science and Engineering, Loyola-ICAM College of Engineering and Technology)
Prakash, A. John (Ramanujan Computing Centre, College of Engineering, Anna University)
Uthariaraj, V. Rhymend (Ramanujan Computing Centre, College of Engineering, Anna University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.16, no.11, 2022 , pp. 3619-3637 More about this Journal

Abstract

Big data analytics offers endless opportunities for operational enhancement by extracting valuable insights from complex voluminous data. Hadoop is a comprehensive technological suite which offers solutions for the large scale storage and computing needs of Big data. The performance of Hadoop is closely tied with its configuration settings which depends on the cluster capacity and the application profile. Since Hadoop has over 190 configuration parameters, tuning them to gain optimal application performance is a daunting challenge. Our approach is to extract a subset of impactful parameters from which the performance enhancing sub-optimal configuration is then narrowed down. This paper presents a statistical model to analyze the significance of the effect of Hadoop parameters on a variety of performance metrics. Our model decomposes the total observed performance variation and ascribes them to the main parameters, their interaction effects and noise factors. The method clearly segregates impactful parameters from the rest. The configuration setting determined by our methodology has reduced the Job completion time by 22%, resource utilization in terms of memory and CPU by 15% and 12% respectively, the number of killed Maps by 50% and Disk spillage by 23%. The proposed technique can be leveraged to ease the configuration tuning task of any Hadoop cluster despite the differences in the underlying infrastructure and the application running on it.

Keywords

ANOVA; Big data; Configuration tuning; General Factorial Design; Hadoop; Impactful parameters;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Christo Petrov, "25+ Impressive Big data statistics for 2021,". [Online]. Available: https://techjury.net/blog/big-data-statistics/#gref
2	Xiang Chen, Yi Liang, Guang-Rui Li, Cheng Chen and Si-Yu Liu, "Optimizing Performance of Hadoop with Parameter Tuning," in Proc. of ITM Web of Conferences, The 4th Annual International Conference on Information Technology and Applications (ITA 2017), vol. 12, Sep. 2017.
3	Keke Chen, James Powers, Shumin Guo and Fengguang Tian, "CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 6, pp. 1403-1412, Jun 2014. DOI
4	Anany Levitin, Introduction to the Design and Analysis of Algorithms, 3rd Edition, New Jersey, US, Pearson Education, 2012.
5	Matthias J. Sax, Malu Castellanos, Qiming Chen, Meichun Hsu, "Performance Optimization for Distributed Intra-Node-Parallel Streaming Systems," in Proc. of IEEE 29th International Conference on Data Engineering Workshops (ICDEW), pp. 62-69, Jun 2013.
6	"Data never sleeps 9.0," 2021. [Online]. Available: https://www.domo.com/learn/infographic/data-never-sleeps-9
7	New Vantage Partners, "Big data and AI Executive Survey," 2019. [Online]. Available: https://www.tcs.com/content/dam/tcs-bts/pdf/insights/Big-Data-Executive-Survey-2019-Findings-Updated-010219-1.pdf
8	Marco Bonaci, "A History of Hadoop," 2015. [Online]. Available: https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
9	Ahmed N, Andre L., C. Barczak, Teo Susnjak and Mohammed A. Rashid, "A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench," Journal of Big Data, vol. 7, pp. 61351-61365, Dec. 2020.
10	Hassan Tariq, Harith Al-Sahaf and Ian Welch, "Modelling and Prediction of Hadoop Clusters: A Machine Learning Approach," in Proc. of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, pp.93-100, Dec. 2019.
11	Mukhtaj Khan, Zhengwen Huang, Maozhen Li, Gareth A. Taylor, Phillip M. Ashton and Mushtaq Khan, "Optimizing Hadoop Performance for Big Data Analytics in Smart Grid," Hindawi, vol. 2017, Nov. 2017, Article ID 2198262.
12	Alex Woodie: "Hadoop has failed us, Tech experts say," 2017. [Online]. Available: https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
13	Herodotos Herodotou, Yuzing Chen, Jiaheng Lu, "A survey on Automatic Parameter Tuning for Big Data Processing Systems," ACM Computing Surveys, vol. 53, no. 2, pp. 1-37, March 2021.
14	Jeffrey Dean and Sanjay Ghemawat, "MapReduce: A Flexible Data Processing Tool," Communications of the ACM, vol. 53, no. 1, pp. 72-77, Jan 2010. DOI
15	Douglas C. Montgomery, Design and Analysis of Experiments, 10th Edition, Arizona, John Wiley & Sons Inc., 2019.
16	Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, "The Google File System," ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, Dec 2003. DOI
17	Jinsong Yin and Yuanyuan Qiao, "Performance Modeling and Optimization of MapReduce Programs," in Proc. of IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, Nov 2014.
18	Tom White, Hadoop: The Definitive Guide, 3rd Edition, CA, US, O'Reilly Media, Inc., 2012.
19	Dominique Heger, "Hadoop Performance Tuning - A pragmatic & Iterative Approach," CMG Journal, vol. 4, pp. 97-113, 2013.
20	Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," ACM Transactions on Computer Systems, vol. 26, no. 2, pp. 1-26, Jun 2008.
21	Guangdeng Liao, Kushal Datta and Theodore L. Willke, "Gunther: Search-Based Auto-Tuning of MapReduce," in Proc. of Euro-Par 2013 Parallel Processing, pp. 406-419, 2013.
22	Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, Yingchun Yang, "BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning," in Proc. of the 2017 Symposium on Cloud Computing, New York, NY, USA, pp. 338-350, 2017.
23	Jun Liu, Sule Tang, Guangxia Xu, Chuang Ma and Mingwei Lin, "A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce," IEEE Access, vol. 8, pp. 63862-63871, Apr. 2020. DOI
24	Liang Bao, Xin Liu, Weizhao Chen, "Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks," in Proc. of 2018 IEEE International Conference on Big Data (Big Data), pp. 181-190, Dec. 2018.
25	Xingchen Hua, Michael C. Huang and Peng Liu, "Hadoop Configuration Tuning with Ensemble Modeling and Metaheuristic Optimization," IEEE Access, vol. 6, pp. 44161-44174, Aug 2018. DOI
26	Narges Peyravi and Ali Moeini, "Estimating runtime of a job in Hadoop MapReduce," Journal of Big Data, vol. 7, Jul 2020.