• Title/Summary/Keyword: checkpointing

Search Result 72, Processing Time 0.032 seconds

A Striped Checkpointing Scheme for the Cluster System with the Distributed RAID (분산 RAID 기반의 클러스터 시스템을 위한 분할된 결함허용정보 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.123-130
    • /
    • 2003
  • This paper presents a new striped checkpointing scheme for serverless cluster computers, where the local disks are attached to the cluster nodes collectively form a distributed RAID with a single I/O space. Striping enables parallel I/O on the distributed disks and staggering avoids network bottleneck in the distributed RAID. We demonstrate how to reduce the checkpointing overhead and increase the availability by striping and staggering dynamically for communication intensive applications. Linpack HPC Benchamark and MPI programs are applied to these checkpointing schemes for performance evaluation on the 16-nodes cluster system. Benchmark results prove the benefits of the striped checkpointing scheme compare to the existing schemes, and these results are useful to design the efficient checkpointing scheme for fast rollback recovery from any single node failure in a cluster system.

Fault-Tolerance Improvement of Real-Time Embedded System using Static Checkpointing (실시간 임베디드 시스템의 결함 허용성 개선을 위한 정적 체크포인팅 방안)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.13 no.12
    • /
    • pp.1147-1152
    • /
    • 2007
  • This paper deals with a scheme for fault-tolerance improvement of real-time embedded systems, which engages an equidistant checkpointing technique to tolerate transient errors. Transient errors are caused by transient faults which are the most significant type of fault in reliable computer systems. Transient faults are assumed to occur according to a Poisson process and to be detected in a non-concurrent manner (e.g., checked periodically). The probability of the successful real-time task completion in the presence of transient errors is derived with the consideration of the possible effects of the transient errors. Based on this, a condition under which inserting checkpoints improves the fault-tolerance of the system is introduced and an optimal equidistant checkpointing strategy that achieves the highest fault tolerance is presented.

Page-level Incremental Checkpointing for Efficient Use of Stable Storage (안정 저장장치의 효율적 사용을 위한 페이지 기반 점진적 검사점 기법)

  • Heo, Jun-Young;Yi, Sang-Ho;Gu, Bon-Cheol;Cho, Yoo-Kun;Hong, Ji-Man
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.12
    • /
    • pp.610-617
    • /
    • 2007
  • Incremental checkpointing, which is intended to minimize checkpointing overhead, saves only the modified pages of a process. However, the cumulative site of incremental checkpoints increases at a steady rate over time because a number of updated values may be saved for the same page. In this paper, we present a comprehensive overview of Pickpt, a page-level incremental checkpointing facility. Pickpt provides space-efficient techniques aiming to minimizing the use of disk space. For our experiments, the results showed that the use of disk space using Pickpt was significantly reduced, compared with existing incremental checkpointing.

A Checkpointing and Error Recovery Algorithm Based on 2-Phase Commit Protocol for Distributed Transaction (분산 트랜잭션 처리 시스템에서 2-단계 확인 프로토콜을 근거로 하는 검사점 설정 및 오류 복구 알고리즘)

  • Park, Yun-Yong;Jeon, Seong-Ik;Jo, Ju-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.2
    • /
    • pp.327-338
    • /
    • 1996
  • In this paper, we present a new checkpointing algorithm to preserve the consistency of resources in distributed transaction processing systems, and the error recovery algorithms to recover form the failure. In comparison with the existed algorithms, the checkpointing algorithm proposed in this paper can minimize the interference of the distributed transaction and the stroage cost during checkpointing, and does not need the extra message to make th checkpoint. Also we show that the error recovery algorithms prevent the distributed transaction with a partial fault from spreading the fault, which calls domnio-effect and prevent them from restarting cyclically. And we describe the correctness and the performane of the proposed algorithms.

  • PDF

Recoverable Distributed shared Memory Systems Using Object-Oriented Dependency Tracking and Checkpointing (객체지향 종속 추적 및 체크포인팅(checkpointing)을 이용한 복구 가능한 분산 공유 메모리 시스템)

  • Kim, Jae-Hun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.2
    • /
    • pp.476-484
    • /
    • 1999
  • Many message logging and checkpointing schemes are proposed for fault tolerance in distributed systems in which nodes communicate by message passing. Most researches for recoverable distributed shared memory (DSM) also adopt similar schemes used in message passing systems. However, schemes used in message passing systems are not always appropriate to be directly used in DSM systems because the two systems, message passing systems and DSM systems, have different natures (function shipping and data shipping). Many modified schemes have been proposed for DSM systems to resolve these differences. In this paper, an object oriented approach is proposed for recoverable DSM. We present a new dependency tracking scheme between pages instead of processes. Based on this scheme, we propose new checkpointing and recovery schemes that can reduce overhead to make DSM recoverable.

  • PDF

Design and Implementation of a User-based MPI Checkpointer for Portability (이식성을 고려한 사용자기반 MPI 체크포인터의 설계 및 구현)

  • Ahn Sun-Il;Han Sang-Yong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.35-43
    • /
    • 2006
  • An MPI Checkpointer is a tool which provides fault-tolerance through checkpointing The previous researches related to the MPI checkpointer have focused on automatic checkpointing and recovery capabilities, but they haven't considered portability issues. In this paper, we discuss design and implementation issues considered for portability when we developed an MPI checkpointer called STFT. In order to increase portability, firstly STFT supports the abstraction interface for a single process checkpointer. Secondly, STFT uses a user-based checkpointing method, and limits possible checkpointing places a user can make. Thirdly, STFT lets the MPI_Init create network connections to the other MPI processes in a fixed order. With these features, we expect STFT can be easily adaptable to various platforms and MPI implementations, and confirmed STFT is easily adaptable to LAM and MPICH/P4 with the prototype Implementation.

New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints (통신 유형 변형을 이용하여 검사점 생성 개수를 개선한 검사점 Z-Cycle 검출 기법)

  • Woo Namyoon;Yeom Heon Young;Park Taesoon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.12
    • /
    • pp.692-703
    • /
    • 2004
  • Communication induced checkpointing (CIC) is one of the checkpointing techniques to provide fault tolerance for distributed systems. Independent checkpoints that each distributed process produces without coordination are likely to be useless. Useless checkpoints, which cannot belong to any consistent global checkpoint sets, induce nondeterminant rollback. To prevent the useless checkpoints, CIC forces processes to take additional checkpoints at proper moment. The number of those forced checkpoints is the main source of failure-free overhead in CIC. In this paper, we present two new CIC protocols which satisfy 'No Z-Cycle (NZC)'property. The proposed protocols reduce the number of forced checkpoints compared to the existing protocols with the drawback of the increase in message delay. Our simulation results with the synthetic data show that the proposed protocols have lower failure-free overhead than the existing protocols. Additionally, we show that the classical 'index-based checkpointing' protocols are inefficient in constructing the consistent global cut in distributed executions.

A Dynamic Reconfiguration Method using Application-level Checkpointing in a Grid Computing Environment with Cactus and Globus (Cactus와 Globus에 기반한 그리드 컴퓨팅 환경에서의 응용프로그램 수준의 체크포인팅을 사용한 동적 재구성 기법)

  • Kim Young Gyun;Oh Gil-ho;Cho Kum Won;Na Jeoung-Su
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.11 no.6
    • /
    • pp.465-476
    • /
    • 2005
  • In this paper, we propose a new dynamic reconfiguration method using application-level checkpointing in a grid computing environment with Cactus and Globus. The existing dynamic reconfiguration methods have been dependent on a specific hardware and operating system. But the proposed method performs a dynamic reconfiguration without supporting specific hardwares and operating systems and, an application is programmed without considering a dynamic reconfiguration. In the proposed method, the job starts with an initial configuration of Computing resources and the job restarts including new resources dynamically found at run-time. The proposed method determines whether to include the newly found idle sites by considering processor performance and available memory of the sites. Our method writes the intermediate results of the job on the disks using system-independent application-level checkpointing for real-time visualization during the job runs. After reconfiguring idle sites and idle processors newly found, the job resumes using checkpointing files. The proposed dynamic reconfiguration method is proved to be valid by decreasing total execution time In K*Grid.

Checkpointing and Rollback-Recovery Protocols in Distributed Computing Systems (분산 계산 환경의 검사점 작성 및 롤백 복구 프로토콜)

  • 안성준;조유근
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10c
    • /
    • pp.93-95
    • /
    • 1999
  • 메시지 전달을 이용한 분산 계산 환경의 검사점 작성 및 롤백 프로토콜은 조정 검사점 작성(coordinated checkpointing), 약조정 검사점, 작성(loosely coordinated checkpointing), 독립적 검사점 작성(independent checkpointint)등 크게 세 종류로 구분할 수 있다. 이 프로토콜들의 성능은 프로세스간 통신의 빈도, 통신의 패턴 등 응용의 특성 및 수행 환경에 영향을 받는다. 기존에 제안된 프로토콜 각각의 성능에 대해서는 많은 연구가 있었으나 이질적인 종류의 프로토콜들을 동일한 환경에서 구현하여 성능을 비교하는 연구는 이루어지지 않았다. 본 논문에서는 검사점 작성 및 롤백 복구 프로토콜들을 구현하고, 동일한 환경에서 성능을 측정한 결과를 제시한다. 아울러 검사점 작성 및 롤백 복구 프로토콜의 성능에 영향을 미치는 요소들을 분석하여, 이들 프로토콜의 성능 평가 기준과 응용의 특성에 적합한 프로토콜의 선택 기준을 제시한다.

  • PDF

Reducing Overhead of Distributed Checkpointing with Group Communication

  • Ahn, Jinho
    • Journal of Advanced Information Technology and Convergence
    • /
    • v.10 no.2
    • /
    • pp.83-90
    • /
    • 2020
  • A protocol HMNR, was proposed to utilize control information of every other process piggybacked on each sent message for minimizing the number of forced checkpoints. Then, an improved protocol, called Lazy-HMNR, was presented to lower the possibility of taking forced checkpoints incurred by the asymmetry between checkpointing frequencies of processes. Despite these two different minimization techniques, if the high message interaction traffic occurs, Lazy-HMNR may considerably lower the probability of knowing whether there occurs no Z-cycle due to its shortcomings. Also, we recognize that no previous work has smart procedures to be able to utilize network infrastructures for highly decreasing the number of forced checkpoints with dependency information carried on every application message. We introduce a novel Lazy-HMNR protocol for group communication-based distributed computing systems to cut back the number of forced checkpoints in a more effective manner. Our simulation outcomes showed that the proposed protocol may highly lessen the frequency of forced checkpoints by comparison to Lazy-HMNR.