Browse > Article
http://dx.doi.org/10.9708/jksci.2015.20.6.059

CORE-Dedup: IO Extent Chunking based Deduplication using Content-Preserving Access Locality  

Kim, Myung-Sik (Dept. of Electronics and Computer Engineering, Hanyang University)
Won, You-Jip (Dept. of Computer Science, Hanyang University)
Abstract
Recent wide spread of embedded devices and technology growth of broadband communication has led to rapid increase in the volume of created and managed data. As a result, data centers have to increase the storage capacity cost-effectively to store the created data. Data deduplication is one way to save the storage space by removing redundant data. This work propose IO extent based deduplication schemes called CORE-Dedup that exploits content-preserving access locality. We acquire IO traces from block device layer in virtual machine host, and compare the deduplication performance of chunking method between the fixed size and IO extent based. At multiple workload of 10 user's compile in virtual machine environment, the result shows that 4 KB fixed size chunking and IO extent based chunking use chunk index 14500 and 1700, respectively. The deduplication rate account for 60.4% and 57.6% on fixed size and IO extent chunking, respectively.
Keywords
Data deduplication; CORE-Dedup; IO Extent Chunking; Content Preserving Access Locality;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. Gantz and D. Reinsel, "The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east," Technical Report, IDC, Tech. Rep., 2012.
2 A. Tridgell, Efficient algorithms for sorting and synchronization. Australian National University Canberra,
3 S. Quinlan and S. Dorward, "Venti: A new approach to archival data storage," in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ser. FAST '02. Berkeley, CA, USA: USENIX Association, 2002.
4 A. Muthitacharoen, B. Chen, and D. Mazie'res, "A low-bandwidth network file system," in Proceedings of the eighteenth ACM symposium on Operating systems principles, ser. SOSP '01. New York, NY, USA: ACM, pp. 174-187, 2001.
5 L. You, K. Pollack, and D. Long, "Deep store: an archival storage system architecture," in data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 804 - 815, april 2005.
6 K. Eshghi and H. K. Tang., "A framework for analyzing and improving content-based chunking algorithms," 2005.
7 E. Kruus, C. Ungureanu, and C. Dubnicki, "Bimodal content defined chunking for backup streams," in Proceedings of the 8th USENIX conference on File and storage technologies, ser. FAST'10. Berkeley, CA, USA: USENIX Association, pp. 18-18, 2010.
8 G. Lu, Y. Jin, and D. H. C. Du, "Frequency based chunking for data de-duplication," in Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, ser. MASCOTS '10. Washington, DC, USA: IEEE Computer Society, pp. 287-296, 2010.
9 M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, "Sparse indexing: Large scale, inline deduplication using sampling and locality," in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST'09), Feb. 2009.
10 A. Mudrankit, "A context aware block layer: The case for block layer deduplication," Ph.D. dissertation, STATE UNIVERSITY OF NEW YORK AT STONY BROOK, 2012.
11 F. Douglis and A. Iyengar, "Application-specific delta-encoding via resemblance detection," in Proceedings of the USENIX Annual Technical Conference, pp. 113-126, 2003.
12 D. Harnik, B. Pinkas, and A. Shulman-Peleg, "Side channels in cloud services: Deduplication in cloud storage," Security Privacy, IEEE, vol. 8, no. 6, pp. 40 -47, Nov.-Dec. 2010.
13 P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey, "Redundancy elimination within large collections of files," in Proceedings of the annual conference on USENIX Annual Technical Conference, ser. ATEC '04. Berkeley, CA, USA: USENIX Association, pp. 5-5, 2004.
14 A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta, "Primary data deduplication large scale study and system design," Proc. USENIX ATC, Boston, MA, 2012.
15 C.-H. Ng, M. Ma, T.-Y. Wong, P. P. C. Lee, and J. C. S. Lui, "Live deduplication storage of virtual machine images in an open-source cloud," in Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware, ser. Middleware'11. Berlin, Heidelberg: Springer-Verlag, pp. 81-100, 2011.
16 A. Liguori and E. Hensbergen, "Experiences with content addressable storage and virtual disks," in Proceedings of the Workshop on I/O Virtualization (WIOV'08), San Diego, CA, 2008.
17 K. Jin and E. L. Miller, "The effectiveness of deduplication on virtual machine disk images," Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference on - SYSTOR '09, no. May, p. 1, 2009.
18 K. R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei, "An empirical analysis of similarity in virtual machine images," in Proceedings of the Middleware 2011 Industry Track Workshop, ser. Middleware '11. New York, NY, USA: ACM, pp. 6:1-6:6, 2011.
19 J. Feng and J. Schindler, "A deduplication study for host-side caches in virtualized data center environments," in Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12, 2013.
20 J. G. Hansen and E. Jul, "Lithium: virtual machine storage for the cloud," in Proceedings of the 1st ACM symposium on Cloud computing, ser. SoCC '10. New York, NY, USA: ACM, pp. 15-26, 2010.
21 P. Nath, M. Kozuch, D. R. O'Hallaron, J. Harkes, M. Satyanarayanan, N. Tolia, and M. Toups, "Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines," in USENIX Annual Technical Conference, General Track, pp. 71- 84, 2006.
22 F. Chen, T. Luo, and X. Zhang, "CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives," in Proceedings of the 9th USENIX conference on File and stroage technologies, ser. FAST'11. Berkeley, CA, USA: USENIX Association, pp. 6-6, 2011.
23 J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. ung Lee, S. Kang, Y. Won, and J. Cha, "Deduplication in SSDs: Model and quantitative analysis," in Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pp. 1 -12, april 2012.
24 A. Gupta, R. Pisolkar, and B. Urgaonkar, "Leveraging value locality in optimizing NAND flash-based ssds," in In Proc. of the 9th USENIX Conference on File and Storage Technologies, FAST'11, 2011.
25 J. Malhotra, P. Sarode, and A. Kamble, "A review of various techniques and approaches of data deduplication," in INTERNATIONAL JOURNAL OF ENGINEERING PRACTICES, vol. 1, no. 1, pp. 29-35, April 2012.
26 L. P. Cox, C. D. Murray, and B. D. Noble, "Pastiche: making backup cheap and easy," SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 285-298, Dec. 2002.   DOI
27 K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti, "iDedup: latency-aware, inline data deduplication for primary storage," in Proceedings of the 10th USENIX conference on File and Storage Technologies, ser. FAST'12. Berkeley, CA, USA: USENIX Association, pp. 24-24, 2012.
28 O. Rodeh and A. Teperman, "ZFS - a scalable distributed file system using object disks," in Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on, pp. 207 - 218, april 2003.
29 B. Debnath, S. Sengupta, and J. Li, "Chunkstash: speeding up inline storage deduplication using flash memory," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIX ATC'10. Berkeley, CA, USA: USENIX Association, pp. 16-16, 2010.
30 W. Xia, H. Jiang, D. Feng, and Y. Hua, "Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput," in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, pp. 26-28, 2011.
31 S. Lee, Y. Yang, and D. Kim, "Hybrid data Deduplication Method for Reducing Wear-Level of SSD-based Server Storage," Journal of KISS : Computer Systems and Theory, vol. 38, pp. 292-297, 2011.
32 R. A. Laura DuBois, "Backup and recovey: Accelerating efficiency and driving down it costs using data deduplication," IDC Information and data, Tech. Rep., 02 2010.
33 J. Bonwick, M. Ahrens, V. Henson, M. Maybee, and M. Shellenbaum, "The zettabyte file system," Tech. Rep., 2003.
34 C. Bo, Z. Li, and W. Can, "Research on chunking algorithms of data de-duplication," in Proceedings of the 2012 International Conference on Communication, Electronics and Automation Engineering. Springer, pp. 1019-1025, 2011.
35 K. Eshghi and H. Tang, "A framework for analyzing and improving content-based chunking algorithms," Hewlett-Packard Labs Technical Report TR, vol. 30, 2005.
36 HP, "Hp Storeonce: Reinventing data deduplication," HP Technical white paper, Tech. Rep., 03 2011.
37 C. Policroniades and I. Pratt, "Alternatives for detecting redundancy in storage systems data," in Proceedings of the 2004 USENIX Annual Technical Conference, pp. 73-86, 2004.
38 D. Meister and A. Brinkmann, "Multi-level comparison of data deduplication in a backup scenario," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ser. SYSTOR '09. New York, NY, USA: ACM, pp. 8:1-8:12, 2009.
39 D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, "Improving duplicate elimination in storage systems," Trans. Storage, vol. 2, no. 4, pp. 424-448, Nov. 2006.   DOI
40 B. Zhu, K. Li, and H. Patterson, "Avoiding the disk bottleneck in the data domain deduplication file system," in Proceedings of the 6th USENIX Conference on File and Storage Technologies, ser. FAST'08. Berkeley, CA, USA: USENIX Association, pp. 18:1-18:14, 2008.
41 J. Min, D. Yoon, and Y. Won, "Efficient deduplication techniques for modern backup operation," Computers, IEEE Transactions on, vol. 60, no. 6, pp. 824 -840, june 2011.   DOI   ScienceOn
42 B. Roman'ski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki, "Anchor-driven subchunk deduplication," in Proceedings of the 4th Annual International Conference on Systems and Storage, ser. SYSTOR '11. New York, NY, USA: ACM, pp. 16:1-16:13, 2011.
43 L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein, "The design of a similarity based deduplication system," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ser. SYSTOR '09. New York, NY, USA: ACM, pp. 6:1-6:14, 2009.
44 D. Wang, A. Sivasubramaniam, and B. Urgaonkar, "A case for heterogeneous flash," The Pennsylvania State University, Department of Computer Science and Engineering, Tech. Rep. CSE-11-015, 2011.
45 D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge, "Extreme binning: Scalable, parallel deduplication for chunk-based file backup," in Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on, pp. 1-9, sept. 2009,
46 V. Tarasov, D. Hildebrand, G. Kuenning, and E. Zadok, "Proceedings of the usenix conference on file and storage technologies (fast)," in Proceedings of the USENIX Conference on File and Storage Technologies (FAST). San Jose, CA: USENIX Association, February 2013.
47 L. K. John, P. Vasudevan, and J. Sabarinathan, "Workload characterization: Motivation, goals and methodology," in Proceedings of the Workload Characterization: Methodology and Case Studies, ser. WWC '98. Washington, DC, USA: IEEE Computer Society, pp. 3, 1998.
48 A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, "Scalability in the xfs file system," in Proceedings of the 1996 annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA: USENIX Association, pp. 1-1, 1996.
49 L. Whitehouse, "Esg analyst brief: Veritas storage foundation high availability 6.0," ESG Research Report, 2010 data Protection Trends, Tech. Rep., April 2010.
50 F. Bellard, "QEMU, a fast and portable dynamic translator," in Proceedings of the Annual Conference on USENIX Annual Technical Conference, Berkeley, CA, USA: USENIX Association, pp. 41-41, 2005.
51 V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, "Proceedings of the annual usenix technical conference," in Proceedings of the Annual USENIX Technical Conference. Boston, MA: USENIX Association, June 2012.