[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.17661/jkiiect.2021.14.1.29

Study of Efficient Algorithm for Deduplication of Complex Structure

Lee, Hyeopgeon (Dept. of Data Analysis, Seoul Ganseo Campus of Korea Polytechnics)
Kim, Young-Woon (Dept. of Software Engineering, Seoil University)
Kim, Ki-Young (Dept. of Software Engineering, Seoil University)

Publication Information

The Journal of Korea Institute of Information, Electronics, and Communication Technology / v.14, no.1, 2021 , pp. 29-36 More about this Journal

Abstract

The amount of data generated has been growing exponentially, and the complexity of data has been increasing owing to the advancement of information technology (IT). Big data analysts and engineers have therefore been actively conducting research to minimize the analysis targets for faster processing and analysis of big data. Hadoop, which is widely used as a big data platform, provides various processing and analysis functions, including minimization of analysis targets through Hive, which is a subproject of Hadoop. However, Hive uses a vast amount of memory for data deduplication because it is implemented without considering the complexity of data. Therefore, an efficient algorithm has been proposed for data deduplication of complex structures. The performance evaluation results demonstrated that the proposed algorithm reduces the memory usage and data deduplication time by approximately 79% and 0.677%, respectively, compared to Hive. In the future, performance evaluation based on a large number of data nodes is required for a realistic verification of the proposed algorithm.

Keywords

Big Data; Deduplication; Hadoop; MapReduce; Hive;

Citations & Related Records

Reference

1	H. G. Lee, Y. W. Kim, K, Y. Kim "Study of In-Memory based Hybrid Big Data Processing Scheme for Improve the Big Data Processing Rate", Journal of Korea Institute of Information, Electronics, and Communication Technology, 12(2), pp. 127-134, April, 2019 DOI
2	In-Hak Joo, "Spatial Big Data Query Processing System Supporting SQL-based Query Language in Hadoop," Journal of Korea Institute of Information, Electronics, and Communication Technology, 10(1), pp.1-8, February, 2017 DOI
3	H. G. Lee, Y. W. Kim, K. Y. Kim "Design of GlusterFS Based Big Data Distributed Processing System in Smart Factory", Journal of Korea Institute of Information, Electronics, and Communication Technology, 11(1), pp.70-75, February, 2018 DOI
4	H. G. Lee, Y. W. Kim, K. Y. Kim, "Implementation of an Efficient Big Data Collection Platform for Smart Manufacturing," Journal of Engineering and Applied Sciences, 12(2Si), pp.6304-6307, 2018
5	Yue Liu, Shuai Guo, Songlin Hu, Tilmann Rabl, Hans-Arno Jacobsen, Jintao Li, Jiye Wang, "Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive," IEEE Transactions on Services Computing, 11(5), pp.835-849, July, 2016 DOI
6	Xi Peng, Liang Liu, Lei Zhang, "A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records," IEEE Access, Vol.8, pp.431-444, December, 2019 DOI
7	Mudassar Ahmad, Safina Kanwal, Maryam Cheema, Muhammad Asif Habib, "Performance Analysis of ECG Big Data using Apache Hive and Apache Pig," 2019 8th International Conference on Information and Communication Technologies(ICICT), November, 2019
8	Jongyeop Kim, Seongsoo Kim, Donghoon Kim, Hong Liu, "Automated Configuration Parameter Classfication Model for Hive Query Plan on the Apache Yarn," 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), May, 2019
9	Zhiang Wu, Aibo Song, Jie Cao, Junzhou Luo, Lu Zhang, "Efficiently Translating Complex SQL Query to MapReduce Jobflow on Cloud," IEEE Transactions on Cloud Computing, 8(2), pp.508-517, May, 2017 DOI
10	Fan Zhang, Majd F. Sakr, Kai Hwang, Samee U. Khan, "Empirical Discovery of Power-Law Distribution in MapReduce Scalability," IEEE Transactions on Cloud Computing, 7(3), pp.744-755, February, 2017 DOI

KSCI

Study of Efficient Algorithm for Deduplication of Complex Structure 복잡한 구조의 데이터 중복제거를 위한 효율적인 알고리즘 연구

Study of Efficient Algorithm for Deduplication of Complex Structure