Browse > Article
http://dx.doi.org/10.22156/CS4SMB.2020.10.10.001

Design and Implementation of Multiple Filter Distributed Deduplication System Applying Cuckoo Filter Similarity  

Kim, Yeong-A (Data HRD Headquarters, EN-CORE.Co.,Ltd.)
Kim, Gea-Hee (Department of Computer Science & Engineering, GNTECH)
Kim, Hyun-Ju (Department of Computer Science & Engineering, GNTECH)
Kim, Chang-Geun (Department of Computer Science & Engineering, GNTECH)
Publication Information
Journal of Convergence for Information Technology / v.10, no.10, 2020 , pp. 1-8 More about this Journal
Abstract
The need for storage, management, and retrieval techniques for alternative data has emerged as technologies based on data generated from business activities conducted by enterprises have emerged as the key to business success in recent years. Existing big data platform systems must load a large amount of data generated in real time without delay to process unstructured data, which is an alternative data, and efficiently manage storage space by utilizing a deduplication system of different storages when redundant data occurs. In this paper, we propose a multi-layer distributed data deduplication process system using the similarity of the Cuckoo hashing filter technique considering the characteristics of big data. Similarity between virtual machines is applied as Cuckoo hash, individual storage nodes can improve performance with deduplication efficiency, and multi-layer Cuckoo filter is applied to reduce processing time. Experimental results show that the proposed method shortens the processing time by 8.9% and increases the deduplication rate by 10.3%.
Keywords
Distributed Deduplication; Big Data; Cuckoo Hash; Multilayer Cuckoo Filter; Software Storage;
Citations & Related Records
Times Cited By KSCI : 8  (Citation Analysis)
연도 인용수 순위
1 A. Sage et al. (2006). Ceph: A Scalable, High-Performance Distributed File System. OSDI, 307-320
2 Leo Project. (2014). The Lion of Sorage Systems. LeoFS. (Online). http://leo-project.net/.
3 P. Raj & A. Raman. (2018). Software-defined storage (SDS) for storage virtualization. In Software-defined cloud centers (pp. 35-64). Springer, Cham.
4 Brodkin et al. (2018). EMC Atoms Cloud Storage. (Online). http://www.emc.com/storage/atmos/atmos.htm/.
5 Amplidata.(2020). Himaraya. (Online).http://amplidata.com/.
6 Amazon. (n. d.). Amazon simple storage service (amazon s3). (Online).http://aws.amazon.com/s3/
7 Google. (n. d.). Google cloud storage. (Online). https://cloud.google.com/storage/docs/json_api/v1/objects.
8 X. Zhao et al. (2014). A scalable deduplication file system for virtual machine images. Parallel and Distributed Systems, IEEE Transactions, 25(5), 1257-1266, DOI : 10.1109 / TPDS.2013.173   DOI
9 R. Kutzelnigg. (2010). An improved version of cuckoo hashing: Average case analysis of construction cost and search operations, Math. Comput. Sci., 3(1), 47-60.   DOI
10 D. Yoon & D. H. Kim. (2018). Distributed data deduplication technique using similarity based clustering and multi-layer bloom filter. Journal of Korean Institute of Next Generation Computing, 14(5), 60-70.
11 S. S. Nam & C. H. Seo. (2016). Privacy Preserving Source Based Deduplicaton Method. Journal of Digital Convergence, 14(2), 175-181 DOI : 10.14400/JDC.2016.14.2.175   DOI
12 S. W. Jeong et al. (2018). Cyber KillChain Based Security Policy Utilizing Hash for Internet of Things. Journal of Digital Convergence, 16(9), 179-185. DOI : 10.14400/JDC.2018.16.9.179   DOI
13 Swift, (n. d.). OpenStack Object Storage.(Online).https://docs.openstack.org/swift/latest/.
14 Y. S. Jeong et al. (2015). An Efficient data management Scheme for Hierarchical Multi-processing using Double Hash Chain. Journal of Digital Convergence, 13(10), 271-278. DOI : 10.14400/JDC.2015.13.10.271   DOI
15 Y. S. Jeong et al (2015). Multi-Attribute based on Data Management Scheme in Big Data Environment. Journal of Digital Convergence, 13(1), 263-268 DOI : 10.14400/JDC.2015.13.1.263   DOI
16 R. Rivest. (1992). The MD5 Message-Digest Algorithm, 1992RFC, IETF Network Working Group.
17 L. Richard et al. (2008). Emerging Tech and Modern IT: The Key to Unlocking Your Data Capital, (Online). http://www.idc.com