Browse > Article
http://dx.doi.org/10.17661/jkiiect.2021.14.1.29

Study of Efficient Algorithm for Deduplication of Complex Structure  

Lee, Hyeopgeon (Dept. of Data Analysis, Seoul Ganseo Campus of Korea Polytechnics)
Kim, Young-Woon (Dept. of Software Engineering, Seoil University)
Kim, Ki-Young (Dept. of Software Engineering, Seoil University)
Publication Information
The Journal of Korea Institute of Information, Electronics, and Communication Technology / v.14, no.1, 2021 , pp. 29-36 More about this Journal
Abstract
The amount of data generated has been growing exponentially, and the complexity of data has been increasing owing to the advancement of information technology (IT). Big data analysts and engineers have therefore been actively conducting research to minimize the analysis targets for faster processing and analysis of big data. Hadoop, which is widely used as a big data platform, provides various processing and analysis functions, including minimization of analysis targets through Hive, which is a subproject of Hadoop. However, Hive uses a vast amount of memory for data deduplication because it is implemented without considering the complexity of data. Therefore, an efficient algorithm has been proposed for data deduplication of complex structures. The performance evaluation results demonstrated that the proposed algorithm reduces the memory usage and data deduplication time by approximately 79% and 0.677%, respectively, compared to Hive. In the future, performance evaluation based on a large number of data nodes is required for a realistic verification of the proposed algorithm.
Keywords
Big Data; Deduplication; Hadoop; MapReduce; Hive;
Citations & Related Records
연도 인용수 순위
  • Reference
1 H. G. Lee, Y. W. Kim, K, Y. Kim "Study of In-Memory based Hybrid Big Data Processing Scheme for Improve the Big Data Processing Rate", Journal of Korea Institute of Information, Electronics, and Communication Technology, 12(2), pp. 127-134, April, 2019   DOI
2 In-Hak Joo, "Spatial Big Data Query Processing System Supporting SQL-based Query Language in Hadoop," Journal of Korea Institute of Information, Electronics, and Communication Technology, 10(1), pp.1-8, February, 2017   DOI
3 H. G. Lee, Y. W. Kim, K. Y. Kim "Design of GlusterFS Based Big Data Distributed Processing System in Smart Factory", Journal of Korea Institute of Information, Electronics, and Communication Technology, 11(1), pp.70-75, February, 2018   DOI
4 H. G. Lee, Y. W. Kim, K. Y. Kim, "Implementation of an Efficient Big Data Collection Platform for Smart Manufacturing," Journal of Engineering and Applied Sciences, 12(2Si), pp.6304-6307, 2018
5 Yue Liu, Shuai Guo, Songlin Hu, Tilmann Rabl, Hans-Arno Jacobsen, Jintao Li, Jiye Wang, "Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive," IEEE Transactions on Services Computing, 11(5), pp.835-849, July, 2016   DOI
6 Xi Peng, Liang Liu, Lei Zhang, "A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records," IEEE Access, Vol.8, pp.431-444, December, 2019   DOI
7 Mudassar Ahmad, Safina Kanwal, Maryam Cheema, Muhammad Asif Habib, "Performance Analysis of ECG Big Data using Apache Hive and Apache Pig," 2019 8th International Conference on Information and Communication Technologies(ICICT), November, 2019
8 Jongyeop Kim, Seongsoo Kim, Donghoon Kim, Hong Liu, "Automated Configuration Parameter Classfication Model for Hive Query Plan on the Apache Yarn," 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), May, 2019
9 Zhiang Wu, Aibo Song, Jie Cao, Junzhou Luo, Lu Zhang, "Efficiently Translating Complex SQL Query to MapReduce Jobflow on Cloud," IEEE Transactions on Cloud Computing, 8(2), pp.508-517, May, 2017   DOI
10 Fan Zhang, Majd F. Sakr, Kai Hwang, Samee U. Khan, "Empirical Discovery of Power-Law Distribution in MapReduce Scalability," IEEE Transactions on Cloud Computing, 7(3), pp.744-755, February, 2017   DOI