• Title/Summary/Keyword: Distributed Deduplication

Search Result 8, Processing Time 0.042 seconds

Distributed data deduplication technique using similarity based clustering and multi-layer bloom filter (SDS 환경의 유사도 기반 클러스터링 및 다중 계층 블룸필터를 활용한 분산 중복제거 기법)

  • Yoon, Dabin;Kim, Deok-Hwan
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.14 no.5
    • /
    • pp.60-70
    • /
    • 2018
  • A software defined storage (SDS) is being deployed in cloud environment to allow multiple users to virtualize physical servers, but a solution for optimizing space efficiency with limited physical resources is needed. In the conventional data deduplication system, it is difficult to deduplicate redundant data uploaded to distributed storages. In this paper, we propose a distributed deduplication method using similarity-based clustering and multi-layer bloom filter. Rabin hash is applied to determine the degree of similarity between virtual machine servers and cluster similar virtual machines. Therefore, it improves the performance compared to deduplication efficiency for individual storage nodes. In addition, a multi-layer bloom filter incorporated into the deduplication process to shorten processing time by reducing the number of the false positives. Experimental results show that the proposed method improves the deduplication ratio by 9% compared to deduplication method using IP address based clusters without any difference in processing time.

Design and Implementation of Multiple Filter Distributed Deduplication System Applying Cuckoo Filter Similarity (쿠쿠 필터 유사도를 적용한 다중 필터 분산 중복 제거 시스템 설계 및 구현)

  • Kim, Yeong-A;Kim, Gea-Hee;Kim, Hyun-Ju;Kim, Chang-Geun
    • Journal of Convergence for Information Technology
    • /
    • v.10 no.10
    • /
    • pp.1-8
    • /
    • 2020
  • The need for storage, management, and retrieval techniques for alternative data has emerged as technologies based on data generated from business activities conducted by enterprises have emerged as the key to business success in recent years. Existing big data platform systems must load a large amount of data generated in real time without delay to process unstructured data, which is an alternative data, and efficiently manage storage space by utilizing a deduplication system of different storages when redundant data occurs. In this paper, we propose a multi-layer distributed data deduplication process system using the similarity of the Cuckoo hashing filter technique considering the characteristics of big data. Similarity between virtual machines is applied as Cuckoo hash, individual storage nodes can improve performance with deduplication efficiency, and multi-layer Cuckoo filter is applied to reduce processing time. Experimental results show that the proposed method shortens the processing time by 8.9% and increases the deduplication rate by 10.3%.

Improving Efficiency of Encrypted Data Deduplication with SGX (SGX를 활용한 암호화된 데이터 중복제거의 효율성 개선)

  • Koo, Dongyoung
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.8
    • /
    • pp.259-268
    • /
    • 2022
  • With prosperous usage of cloud services to improve management efficiency due to the explosive increase in data volume, various cryptographic techniques are being applied in order to preserve data privacy. In spite of the vast computing resources of cloud systems, decrease in storage efficiency caused by redundancy of data outsourced from multiple users acts as a factor that significantly reduces service efficiency. Among several approaches on privacy-preserving data deduplication over encrypted data, in this paper, the research results for improving efficiency of encrypted data deduplication using trusted execution environment (TEE) published in the recent USENIX ATC are analysed in terms of security and efficiency of the participating entities. We present a way to improve the stability of a key-managing server by integrating it with individual clients, resulting in secure deduplication without independent key servers. The experimental results show that the communication efficiency of the proposed approach can be improved by about 30% with the effect of a distributed key server while providing robust security guarantees as the same level of the previous research.

Performance Analysis of Open Source Based Distributed Deduplication File System (오픈 소스 기반 데이터 분산 중복제거 파일 시스템의 성능 분석)

  • Jung, Sung-Ouk;Choi, Hoon
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.12
    • /
    • pp.623-631
    • /
    • 2014
  • Comparison of two representative deduplication file systems, LessFS and SDFS, shows that Lessfs is better in execution time and CPU utilization while SDFS is better in storage usage (around 1/8 less than general file systems). In this paper, a new system is proposed where the advantages of SDFS and Lessfs are combined. The new system uses multiple DFEs and one DSE to maintain the integrity and consistency of the data. An evaluation study to compare between Single DFE and Dual DFE indicates that the Dual DFE was better than the Single DFE. The Dual DFE reduced the CPU usage and provided fast deduplication time. This reveals that proposed system can be used to solve the problem of an increase in large data storage and power consumption.

Spark-based Distributed and Parallel DNA Deduplication Method (Spark 기반의 분산 병렬 DNA 중복제거 방법)

  • Moon, Jihye;Lee, Hyeonbyeong;Song, Seokil
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2017.05a
    • /
    • pp.313-314
    • /
    • 2017
  • 이 논문에서는 DNA 분석단계 중 하나인 DNA 리드(Read)에 대한 중복제거 방법을 분산 병렬처리 기법을 적용하여 가속화하는 방법을 제안한다. 기존 제안된 중복제거 기법을 Spark을 기반으로 병렬처리 되도록 하는 접근방법을 취한다. 제안하는 기법은 실험을 통해서 기존 중복제거 기법과 비교하여 성능을 입증한다.

  • PDF

Chunk Placement Scheme on Distributed File System Using Deduplication File System (중복제거 파일 시스템을 적용한 분산 파일 시스템에서의 청크 배치 기법)

  • Kim, Keonwoo;Kim, Jeehong;Eom, Young Ik
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.68-70
    • /
    • 2013
  • 대량의 데이터를 효과적으로 저장하고 관리하기 위해서 클라우드 스토리지 시스템에서는 분산 파일 시스템 기술이 이용되고 있다. 그러나 데이터가 증가함에 따라 분산 파일 시스템을 이용함에도 스토리지 확장 비용이 증가하게 된다. 본 논문에서는 분산 파일 시스템의 스토리지 확장 비용을 줄이기 위해서 우리는 중복제거 파일 시스템을 적용한 분산 파일 시스템에서의 청크 배치 기법을 제안한다. 오픈 소스 기반의 분산 파일 시스템인 MooseFS 에 중복제거 파일 시스템인 lessfs 를 적용함으로써 스토리지의 가용공간을 늘릴 수 있으며, 이는 스토리지 확장 비용을 줄이는 효과를 가져온다. 또한, 동일한 청크는 같은 청크 서버에 배치 시킴으로써 중복제거 기회를 높인다. 실험을 통해서 제안 시스템의 중복제거량과 성능에 대해서 평가한다.

A Scheme on High-Performance Caching and High-Capacity File Transmission for Cloud Storage Optimization (클라우드 스토리지 최적화를 위한 고속 캐싱 및 대용량 파일 전송 기법)

  • Kim, Tae-Hun;Kim, Jung-Han;Eom, Young-Ik
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.8C
    • /
    • pp.670-679
    • /
    • 2012
  • The recent dissemination of cloud computing makes the amount of data storage to be increased and the cost of storing the data grow rapidly. Accordingly, data and service requests from users also increases the load on the cloud storage. There have been many works that tries to provide low-cost and high-performance schemes on distributed file systems. However, most of them have some weaknesses on performing parallel and random data accesses as well as data accesses of frequent small workloads. Recently, improving the performance of distributed file system based on caching technology is getting much attention. In this paper, we propose a CHPC(Cloud storage High-Performance Caching) framework, providing parallel caching, distributed caching, and proxy caching in distributed file systems. This study compares the proposed framework with existing cloud systems in regard to the reduction of the server's disk I/O, prevention of the server-side bottleneck, deduplication of the page caches in each client, and improvement of overall IOPS. As a results, we show some optimization possibilities on the cloud storage systems based on some evaluations and comparisons with other conventional methods.

Study on Automation of Comprehensive IT Asset Management (포괄적 IT 자산관리의 자동화에 관한 연구)

  • Wonseop Hwang;Daihwan Min;Junghwan Kim;Hanjin Lee
    • Journal of Information Technology Services
    • /
    • v.23 no.1
    • /
    • pp.1-10
    • /
    • 2024
  • The IT environment is changing due to the acceleration of digital transformation in enterprises and organizations. This expansion of the digital space makes centralized cybersecurity controls more difficult. For this reason, cyberattacks are increasing in frequency and severity and are becoming more sophisticated, such as ransomware and digital supply chain attacks. Even in large organizations with numerous security personnel and systems, security incidents continue to occur due to unmanaged and unknown threats and vulnerabilities to IT assets. It's time to move beyond the current focus on detecting and responding to security threats to managing the full range of cyber risks. This requires the implementation of asset Inventory for comprehensive management by collecting and integrating all IT assets of the enterprise and organization in a wide range. IT Asset Management(ITAM) systems exist to identify and manage various assets from a financial and administrative perspective. However, the asset information managed in this way is not complete, and there are problems with duplication of data. Also, it is insufficient to update of data-set, including Network Infrastructure, Active Directory, Virtualization Management, and Cloud Platforms. In this study, we, the researcher group propose a new framework for automated 'Comprehensive IT Asset Management(CITAM)' required for security operations by designing a process to automatically collect asset data-set. Such as the Hostname, IP, MAC address, Serial, OS, installed software information, last seen time, those are already distributed and stored in operating IT security systems. CITAM framwork could classify them into unique device units through analysis processes in term of aggregation, normalization, deduplication, validation, and integration.