DOI QR코드

DOI QR Code

CORE-Dedup: IO Extent Chunking based Deduplication using Content-Preserving Access Locality

CORE-Dedup: 내용보존 접근 지역성 활용한 IO 크기 분할 기반 중복제거

  • Kim, Myung-Sik (Dept. of Electronics and Computer Engineering, Hanyang University) ;
  • Won, You-Jip (Dept. of Computer Science, Hanyang University)
  • 김명식 (한양대학교 전자컴퓨터통신공학과) ;
  • 원유집 (한양대학교 컴퓨터소프트웨어학과)
  • Received : 2015.01.06
  • Accepted : 2015.06.08
  • Published : 2015.06.30

Abstract

Recent wide spread of embedded devices and technology growth of broadband communication has led to rapid increase in the volume of created and managed data. As a result, data centers have to increase the storage capacity cost-effectively to store the created data. Data deduplication is one way to save the storage space by removing redundant data. This work propose IO extent based deduplication schemes called CORE-Dedup that exploits content-preserving access locality. We acquire IO traces from block device layer in virtual machine host, and compare the deduplication performance of chunking method between the fixed size and IO extent based. At multiple workload of 10 user's compile in virtual machine environment, the result shows that 4 KB fixed size chunking and IO extent based chunking use chunk index 14500 and 1700, respectively. The deduplication rate account for 60.4% and 57.6% on fixed size and IO extent chunking, respectively.

고성능 내장형 기기의 대중화 및 광대역 통신기술의 발달로 생성-관리되는 데이터가 증가하고 있다. 중복제거 기법은 중복된 저장 요청을 판별하여 유일한 데이터만을 저장함으로써 저장 공간을 절약하는 방법으로 폭증하는 데이터의 저장과 처리 시스템을 경제적으로 구축 할 수 있다. 본 연구는 입출력 크기 (IO Extent) 단위 기반 분할 방법을 사용한 CORE-Dedup을 제안한다. CORE-Dedup의 Extent 단위 분할은 접근한 Content가 보존하는 접근 단위의 속성을 활용 한다. 가상머신에서 IO 경향을 수집하고 고정 크기 분할과 새로운 Extent 분할 방법에 대해 중복제거 성능을 비교 평가하였다. 동일 크기 워크로드 경우 4 KB 고정 분할 대비 적은 색인 버퍼를 가지고 유사한 수준의 중복 비교를 성능을 얻을 수 있다. 특히 다수 유저의 유사 IO 중복 접근을 가정한 워크로드 경우에는 CORE-Dedup이 Extent 단위 분할의 넓은 워크로드 Coverage에 의해 고정 크기 분할을 사용한 동일 조건의 Inline-Dedup에 비해 1/10 수준 버퍼를 가지고도 유사 중복제거 성능을 얻었다. 10명 사용자의 동일 compile 입출력을 가정한 병합 워크로드에서 4 KB 고정 크기 분할에서는 14,500개 분할 색인에서 최대 60.4%의 중복 발견율을 얻었으나 Extent 분할에서는 1,700개 색인만으로 57.6%를 얻었다.

Keywords

References

  1. J. Gantz and D. Reinsel, "The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east," Technical Report, IDC, Tech. Rep., 2012.
  2. A. Tridgell, Efficient algorithms for sorting and synchronization. Australian National University Canberra,
  3. S. Quinlan and S. Dorward, "Venti: A new approach to archival data storage," in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ser. FAST '02. Berkeley, CA, USA: USENIX Association, 2002.
  4. A. Muthitacharoen, B. Chen, and D. Mazie'res, "A low-bandwidth network file system," in Proceedings of the eighteenth ACM symposium on Operating systems principles, ser. SOSP '01. New York, NY, USA: ACM, pp. 174-187, 2001.
  5. L. You, K. Pollack, and D. Long, "Deep store: an archival storage system architecture," in data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 804 - 815, april 2005.
  6. K. Eshghi and H. K. Tang., "A framework for analyzing and improving content-based chunking algorithms," 2005.
  7. E. Kruus, C. Ungureanu, and C. Dubnicki, "Bimodal content defined chunking for backup streams," in Proceedings of the 8th USENIX conference on File and storage technologies, ser. FAST'10. Berkeley, CA, USA: USENIX Association, pp. 18-18, 2010.
  8. G. Lu, Y. Jin, and D. H. C. Du, "Frequency based chunking for data de-duplication," in Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, ser. MASCOTS '10. Washington, DC, USA: IEEE Computer Society, pp. 287-296, 2010.
  9. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, "Sparse indexing: Large scale, inline deduplication using sampling and locality," in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST'09), Feb. 2009.
  10. A. Mudrankit, "A context aware block layer: The case for block layer deduplication," Ph.D. dissertation, STATE UNIVERSITY OF NEW YORK AT STONY BROOK, 2012.
  11. F. Douglis and A. Iyengar, "Application-specific delta-encoding via resemblance detection," in Proceedings of the USENIX Annual Technical Conference, pp. 113-126, 2003.
  12. P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey, "Redundancy elimination within large collections of files," in Proceedings of the annual conference on USENIX Annual Technical Conference, ser. ATEC '04. Berkeley, CA, USA: USENIX Association, pp. 5-5, 2004.
  13. A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta, "Primary data deduplication large scale study and system design," Proc. USENIX ATC, Boston, MA, 2012.
  14. C.-H. Ng, M. Ma, T.-Y. Wong, P. P. C. Lee, and J. C. S. Lui, "Live deduplication storage of virtual machine images in an open-source cloud," in Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware, ser. Middleware'11. Berlin, Heidelberg: Springer-Verlag, pp. 81-100, 2011.
  15. D. Harnik, B. Pinkas, and A. Shulman-Peleg, "Side channels in cloud services: Deduplication in cloud storage," Security Privacy, IEEE, vol. 8, no. 6, pp. 40 -47, Nov.-Dec. 2010.
  16. A. Liguori and E. Hensbergen, "Experiences with content addressable storage and virtual disks," in Proceedings of the Workshop on I/O Virtualization (WIOV'08), San Diego, CA, 2008.
  17. K. Jin and E. L. Miller, "The effectiveness of deduplication on virtual machine disk images," Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference on - SYSTOR '09, no. May, p. 1, 2009.
  18. K. R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei, "An empirical analysis of similarity in virtual machine images," in Proceedings of the Middleware 2011 Industry Track Workshop, ser. Middleware '11. New York, NY, USA: ACM, pp. 6:1-6:6, 2011.
  19. J. Feng and J. Schindler, "A deduplication study for host-side caches in virtualized data center environments," in Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12, 2013.
  20. J. G. Hansen and E. Jul, "Lithium: virtual machine storage for the cloud," in Proceedings of the 1st ACM symposium on Cloud computing, ser. SoCC '10. New York, NY, USA: ACM, pp. 15-26, 2010.
  21. P. Nath, M. Kozuch, D. R. O'Hallaron, J. Harkes, M. Satyanarayanan, N. Tolia, and M. Toups, "Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines," in USENIX Annual Technical Conference, General Track, pp. 71- 84, 2006.
  22. F. Chen, T. Luo, and X. Zhang, "CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives," in Proceedings of the 9th USENIX conference on File and stroage technologies, ser. FAST'11. Berkeley, CA, USA: USENIX Association, pp. 6-6, 2011.
  23. J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. ung Lee, S. Kang, Y. Won, and J. Cha, "Deduplication in SSDs: Model and quantitative analysis," in Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pp. 1 -12, april 2012.
  24. A. Gupta, R. Pisolkar, and B. Urgaonkar, "Leveraging value locality in optimizing NAND flash-based ssds," in In Proc. of the 9th USENIX Conference on File and Storage Technologies, FAST'11, 2011.
  25. J. Malhotra, P. Sarode, and A. Kamble, "A review of various techniques and approaches of data deduplication," in INTERNATIONAL JOURNAL OF ENGINEERING PRACTICES, vol. 1, no. 1, pp. 29-35, April 2012.
  26. L. P. Cox, C. D. Murray, and B. D. Noble, "Pastiche: making backup cheap and easy," SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 285-298, Dec. 2002. https://doi.org/10.1145/844128.844155
  27. K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti, "iDedup: latency-aware, inline data deduplication for primary storage," in Proceedings of the 10th USENIX conference on File and Storage Technologies, ser. FAST'12. Berkeley, CA, USA: USENIX Association, pp. 24-24, 2012.
  28. B. Debnath, S. Sengupta, and J. Li, "Chunkstash: speeding up inline storage deduplication using flash memory," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIX ATC'10. Berkeley, CA, USA: USENIX Association, pp. 16-16, 2010.
  29. W. Xia, H. Jiang, D. Feng, and Y. Hua, "Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput," in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, pp. 26-28, 2011.
  30. O. Rodeh and A. Teperman, "ZFS - a scalable distributed file system using object disks," in Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on, pp. 207 - 218, april 2003.
  31. S. Lee, Y. Yang, and D. Kim, "Hybrid data Deduplication Method for Reducing Wear-Level of SSD-based Server Storage," Journal of KISS : Computer Systems and Theory, vol. 38, pp. 292-297, 2011.
  32. R. A. Laura DuBois, "Backup and recovey: Accelerating efficiency and driving down it costs using data deduplication," IDC Information and data, Tech. Rep., 02 2010.
  33. J. Bonwick, M. Ahrens, V. Henson, M. Maybee, and M. Shellenbaum, "The zettabyte file system," Tech. Rep., 2003.
  34. C. Bo, Z. Li, and W. Can, "Research on chunking algorithms of data de-duplication," in Proceedings of the 2012 International Conference on Communication, Electronics and Automation Engineering. Springer, pp. 1019-1025, 2011.
  35. K. Eshghi and H. Tang, "A framework for analyzing and improving content-based chunking algorithms," Hewlett-Packard Labs Technical Report TR, vol. 30, 2005.
  36. C. Policroniades and I. Pratt, "Alternatives for detecting redundancy in storage systems data," in Proceedings of the 2004 USENIX Annual Technical Conference, pp. 73-86, 2004.
  37. D. Meister and A. Brinkmann, "Multi-level comparison of data deduplication in a backup scenario," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ser. SYSTOR '09. New York, NY, USA: ACM, pp. 8:1-8:12, 2009.
  38. D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, "Improving duplicate elimination in storage systems," Trans. Storage, vol. 2, no. 4, pp. 424-448, Nov. 2006. https://doi.org/10.1145/1210596.1210599
  39. HP, "Hp Storeonce: Reinventing data deduplication," HP Technical white paper, Tech. Rep., 03 2011.
  40. B. Zhu, K. Li, and H. Patterson, "Avoiding the disk bottleneck in the data domain deduplication file system," in Proceedings of the 6th USENIX Conference on File and Storage Technologies, ser. FAST'08. Berkeley, CA, USA: USENIX Association, pp. 18:1-18:14, 2008.
  41. J. Min, D. Yoon, and Y. Won, "Efficient deduplication techniques for modern backup operation," Computers, IEEE Transactions on, vol. 60, no. 6, pp. 824 -840, june 2011. https://doi.org/10.1109/TC.2010.263
  42. B. Roman'ski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki, "Anchor-driven subchunk deduplication," in Proceedings of the 4th Annual International Conference on Systems and Storage, ser. SYSTOR '11. New York, NY, USA: ACM, pp. 16:1-16:13, 2011.
  43. L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein, "The design of a similarity based deduplication system," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ser. SYSTOR '09. New York, NY, USA: ACM, pp. 6:1-6:14, 2009.
  44. D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge, "Extreme binning: Scalable, parallel deduplication for chunk-based file backup," in Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on, pp. 1-9, sept. 2009,
  45. V. Tarasov, D. Hildebrand, G. Kuenning, and E. Zadok, "Proceedings of the usenix conference on file and storage technologies (fast)," in Proceedings of the USENIX Conference on File and Storage Technologies (FAST). San Jose, CA: USENIX Association, February 2013.
  46. L. K. John, P. Vasudevan, and J. Sabarinathan, "Workload characterization: Motivation, goals and methodology," in Proceedings of the Workload Characterization: Methodology and Case Studies, ser. WWC '98. Washington, DC, USA: IEEE Computer Society, pp. 3, 1998.
  47. D. Wang, A. Sivasubramaniam, and B. Urgaonkar, "A case for heterogeneous flash," The Pennsylvania State University, Department of Computer Science and Engineering, Tech. Rep. CSE-11-015, 2011.
  48. A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, "Scalability in the xfs file system," in Proceedings of the 1996 annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA: USENIX Association, pp. 1-1, 1996.
  49. L. Whitehouse, "Esg analyst brief: Veritas storage foundation high availability 6.0," ESG Research Report, 2010 data Protection Trends, Tech. Rep., April 2010.
  50. F. Bellard, "QEMU, a fast and portable dynamic translator," in Proceedings of the Annual Conference on USENIX Annual Technical Conference, Berkeley, CA, USA: USENIX Association, pp. 41-41, 2005.
  51. V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, "Proceedings of the annual usenix technical conference," in Proceedings of the Annual USENIX Technical Conference. Boston, MA: USENIX Association, June 2012.