Browse > Article
http://dx.doi.org/10.3837/tiis.2021.04.009

Dynamic Prime Chunking Algorithm for Data Deduplication in Cloud Storage  

Ellappan, Manogar (Department of Information Science and Technology, College of Engineering Anna University)
Abirami, S (Department of Information Science and Technology, College of Engineering Anna University)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.15, no.4, 2021 , pp. 1342-1359 More about this Journal
Abstract
The data deduplication technique identifies the duplicates and minimizes the redundant storage data in the backup server. The chunk level deduplication plays a significant role in detecting the appropriate chunk boundaries, which solves the challenges such as minimum throughput and maximum chunk size variance in the data stream. To provide the solution, we propose a new chunking algorithm called Dynamic Prime Chunking (DPC). The main goal of DPC is to dynamically change the window size within the prime value based on the minimum and maximum chunk size. According to the result, DPC provides high throughput and avoid significant chunk variance in the deduplication system. The implementation and experimental evaluation have been performed on the multimedia and operating system datasets. DPC has been compared with existing algorithms such as Rabin, TTTD, MAXP, and AE. Chunk Count, Chunking time, throughput, processing time, Bytes Saved per Second (BSPS) and Deduplication Elimination Ratio (DER) are the performance metrics analyzed in our work. Based on the analysis of the results, it is found that throughput and BSPS have improved. Firstly, DPC quantitatively improves throughput performance by more than 21% than AE. Secondly, BSPS increases a maximum of 11% than the existing AE algorithm. Due to the above reason, our algorithm minimizes the total processing time and achieves higher deduplication efficiency compared with the existing Content Defined Chunking (CDC) algorithms.
Keywords
Content Defined Chunking; Dynamic Prime Chunking; Cloud Storage; Data Deduplication; Performance Evaluation; Throughput;
Citations & Related Records
연도 인용수 순위
  • Reference
1 I. Lkhagvasuren, J. M. So, J. G. Lee, C. Yoo, and Y. W. Ko, "Byte-index Chunking algorithm for data deduplication system," International Journal of Security and its Applications, vol. 7, no. 5, pp. 415-424, 2013.   DOI
2 A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee, "Redundancy in network traffic: findings and implications," ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, 2009.
3 D. Reinsel, J. Gantz, and J. Rydning, "Data age 2025: The evolution of data to life-critical don't focus on big data," Framingham: IDC Analyze the Future, 2017.
4 S. Quinlan and S. Dorward, "Venti: A New Approach to Archival Storage," FAST 2002 Paper, vol. 2, pp. 89-101, 2002.
5 D. T. Meyer and W. J. Bolosky, "A study of practical deduplication," ACM Transactions on Storage(ToS), vol. 7, no. 4, pp. 1-20, 2012.
6 A. El-Shimi, R. Kalach, A. Kumar, A. Ottean, J. Li, and S. Sengupta, "Primary data deduplicationlarge scale study and system design," in Proc. of USENIX Annual Technical Conference, pp. 285-296, 2012.
7 X. Xiaolong and Q. Tu, "Data deduplication mechanism for cloud storage systems," in Proc. of International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 286-294, 2015.
8 R. Vestergaard, Q. Zhang, and D. E. Lucani, "Lossless Compression of Time Series Data with Generalized Deduplication," in Proc. of IEEE Global Communications Conference (GLOBECOM), pp. 1-6, 2019.
9 Y. Zhang, W. Xia, D. Feng, H. Jiang, Y. Hua, and Q. Wang, "Finesse: fine-grained feature locality based fast resemblance detection for post-deduplication delta compression," in Proc. of the 17th {USENIX} Conference on File and Storage Technologies, pp. 121-128, 2019.
10 M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan, "Design tradeoffs for data deduplication performance in backup workloads," in Proc. of the 13 th USENIX Conference on File and Storage Technologies, pp. 331-344, 2015.
11 W. Xia, H. Jiang, D. Feng, and L. Tian, "Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets," in Proc. of IEEE Data Compression Conference, pp. 203-212, 2014.
12 W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu, and Y. Zhang, "Fastcdc: a fast and efficient content-defined chunking approach for data deduplication," in Proc. of USENIX Annual Technical Conference, pp. 101-114, 2016.
13 K. Eshghi and H. K. Tang, "A framework for analyzing and improving content-based chunking algorithms," Hewlett-Packard Labs Technical Report TR, vol. 30, pp. 1-10, 2005.
14 E. Manogar and S. Abirami, "A study on data deduplication techniques for optimized storage," in Proc. of the 6th International Conference on Advanced Computing(ICoAC), pp. 161-166, 2014.
15 G. Lu, Y. Jin, and D. Du, "Frequency based chunking for data de-duplication," in Proc. of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, vol. 1, pp. 287-296, 2010.
16 D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge, "Extreme binning: Scalable, parallel deduplication for chunk-based file backup," in Proc. of IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1-9, 2009.
17 T. S. Moh and B. C. Chang, "A running time improvement for the two thresholds two divisors algorithm," in Proc. of the 48th Annual Southeast Regional Conference, pp. 1-6, 2010.
18 R. Widodo, H. Lim, and M. Atiquzzaman, "A new content-defined chunking algorithm for data deduplication in cloud storage," Future Generation Computer Systems, vol. 71, pp. 145-156, 2017.   DOI
19 C. Yu, C. Zhang, Y. Mao, and F. Li, "Leap-based content defined chunking-theory and implementation," in Proc. of the 31st Symposium on Mass Storage Systems and Technologies(MSST), pp. 1-12, 2015.
20 W. Zhanjie and S. Lang, "Research on Distributional Stability of Chunk Sizes in Data Chunking," International Journal of Digital Content Technology and its Applications, vol. 7, no. 5, pp. 443-450, 2013.   DOI
21 C. Zhang, D. Qi, Z. Cai, W. Huang, X. Wang, W. Li, and J. Guo, "MII: A novel content defined chunking algorithm for finding incremental data in data synchronization," IEEE Access, vol. 7, pp. 86932-86945, 2019.   DOI
22 N. Kumar and S. C. Jain, "Efficient data deduplication for big data storage systems," Progress in Advanced Computing and Intelligent Engineering, pp. 351-371, 2019.
23 W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu, and Y. Zhou, "Ddelta: A deduplication-inspired fast delta compression approach," Performance Evaluation, vol. 79, pp. 258-272, 2014.   DOI
24 N. Bjorner, A. Blass, and Y. Gurevich, "Content-dependent chunking for differential compression, the local maximum approach," Journal of Computer and System Sciences, vol. 76, no. 3-4, pp. 154-203, 2010.   DOI
25 Y. Zhang, H. Jiang, D. Feng, W. Xia, M. Fu, F. Huang, and Y. Zhou, "AE : An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication," in Proc. of IEEE Conference on Computer Communications, pp. 1337-1345, 2015.
26 R. Vinoth and L. J. Deborah, "A Survey on Efficient Storage and Retrieval System for the Implementation of Data Deduplication in Cloud," in Proc. of International Conference on Computer Networks, Big data and IoT, pp. 876-884, 2019.
27 S. Saharan, G. Somani, G. Gupta, R. Verma, M. S. Gaur, and R. Buyya, "QuickDedup: Efficient VM deduplication in cloud computing environments," Journal of Parallel and Distributed Computing, vol. 139, pp. 18-31, 2020.   DOI
28 Z. Pooranian, K. C. Chen, C. M. Yu, and M. Conti, "RARE: Defeating side channels based on data-deduplication in cloud storage," in Proc. of IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops, pp. 444-449, 2018.
29 S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, "Storegpu: exploiting graphics processing units to accelerate distributed storage systems," in Proc. of the 17th International Symposium on High Performance Distributed Computing, pp. 165-174, 2008.
30 B. Zhu, K. Li, and H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," in Proc. of the 6th USENIX Conference on File and Storage Technologies (FAST'08), vol. 8, pp. 1-14, 2008.
31 W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang, and Y. Zhou, "A comprehensive study of the past, present and future of data deduplication," in Proc. of the IEEE, vol. 104, no. 9, pp. 1681-1710, 2016.   DOI
32 H. Wu, C. Wang, K. Lu, Y. Fu, and L. Zhu, "One size does not fit all: The case for chunking configuration in backup deduplication," in Proc. of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(CCGRID), pp. 213-222, 2018.
33 A. Venish and K. S. Sankar, "Study of chunking algorithm in data deduplication," in Proc. of International Conference on Soft Computing Systems, pp. 13-20, 2016.
34 P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey, "Redundancy elimination within large collections of files." in Proc. of USENIX ATC, 2004.
35 M. O. Rabin, "Fingerprinting by random polynomials," Technical report, 1981.
36 Y. Zhang, D. Feng, H. Jiang, W. Xia, M. Fu, F. Huang, and Y. Zhou, "A fast asymmetric extremum content defined chunking algorithm for data deduplication in backup storage systems," IEEE Transactions on Computers, vol. 66, no. 2, pp. 199-211, 2016.   DOI
37 Z. Pooranian, M. Shojafar, S. Garg, R. Taheri, and Rahim Tafazolli, "LEVER: Secure Deduplicated Cloud Storage with Encrypted Two-Party Interactions in Cyber-Physical Systems," IEEE Transactions on Industrial Informatics, 2020.
38 A. Muthitacharoen, B. Chen, and D. Mazieres, "A low-bandwidth network file system," in Proc. of the 18th ACM Symposium on Operating Systems Principles, pp. 174-187, 2001.
39 W. Xia, H. Jiang, D. Feng, L. Tian, M. Fu, and Z. Wang, "P-dedupe: Exploiting parallelism in data deduplication system," in Proc. of the 7th International Conference on Networking, Architecture, and Storage, pp. 338-347, 2012.