• Title/Summary/Keyword: Checkpointing and Recovery

Search Result 30, Processing Time 0.029 seconds

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems (결함 내성 분산 시스템에서의 동적 검사점 스케쥴링 기법)

  • Park, Tae-Soon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.2
    • /
    • pp.75-86
    • /
    • 2002
  • The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.

Analysis of Checkpointing Model with Instantaneous Error Detection (즉각적 오류 감지가 가능한 경우의 체크포인팅 모형 분석)

  • Lee, Yutae
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.1
    • /
    • pp.170-175
    • /
    • 2022
  • Reactive failure management techniques are required to mitigate the impact of errors in high performance computing. Checkpoint is the standard recovery technique for coping with errors. An application employing checkpoints periodically saves its state, so that when an error occurs while some task is executing, the application is rolled back to its last checkpointed task and resumes execution from that task onward. In this paper, assuming the time-to-errors are independent each other and generally distributed, we analyze the checkpointing model with instantaneous error detection. The conventional assumption that two or more errors do not take place between two consecutive checkpoints is removed. Given the checkpointing time, down-time, and recovery time, we derive the reliability of the checkpointing model. When the time-to-error follows an exponential distribution, we obtain the optimal checkpointing interval to achieve the maximum reliability.

A Checkpointing and Error Recovery Algorithm Based on 2-Phase Commit Protocol for Distributed Transaction (분산 트랜잭션 처리 시스템에서 2-단계 확인 프로토콜을 근거로 하는 검사점 설정 및 오류 복구 알고리즘)

  • Park, Yun-Yong;Jeon, Seong-Ik;Jo, Ju-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.2
    • /
    • pp.327-338
    • /
    • 1996
  • In this paper, we present a new checkpointing algorithm to preserve the consistency of resources in distributed transaction processing systems, and the error recovery algorithms to recover form the failure. In comparison with the existed algorithms, the checkpointing algorithm proposed in this paper can minimize the interference of the distributed transaction and the stroage cost during checkpointing, and does not need the extra message to make th checkpoint. Also we show that the error recovery algorithms prevent the distributed transaction with a partial fault from spreading the fault, which calls domnio-effect and prevent them from restarting cyclically. And we describe the correctness and the performane of the proposed algorithms.

  • PDF

A Multistriped Checkpointing Scheme for the Fault-tolerant Cluster Computers (다중 분할된 구조를 가지는 클러스터 검사점 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.13A no.7 s.104
    • /
    • pp.607-614
    • /
    • 2006
  • The checkpointing schemes should reduce the process delay through managing the checkpoints of each node to fit the network load to enhance the performance of the process running on the cluster system that write the checkpoints into its global stable storage. For this reason, a cluster system with single IO space on a distributed RAID chooses a suitable checkpointng scheme to get the maximum IO performance and the best rollback recovery efficiency. In this paper, we improved the striped checkpointing scheme with dynamic stripe group size by adapting to the network bandwidth variation at the point of checkpointing. To analyze the performance of the multi striped checkpointing scheme, we applied Linpack HPC benchmark with MPI on our own cluster system with maximum 512 virtual nodes. The benchmark results showed that the multistriped checkpointing scheme has better performance than the striped checkpointing scheme on the checkpoint writing efficiency and rollback recovery at heavy system load.

Taking Point Decision Mechanism of Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time (프로세스 수행 시간의 비용 분석에 기반을 둔 페이지 단위 점진적 검사점의 작성 시점 결정 기법)

  • Yi Sang-Ho;Heo Jun-Young;Hong Ji-Man
    • The KIPS Transactions:PartA
    • /
    • v.13A no.4 s.101
    • /
    • pp.289-294
    • /
    • 2006
  • Checkpointing is an effective mechanism that allows a process to resume its execution that was discontinued by a system failure without having to restart from the beginning. Especially, page-level incremental checkpointing saves only the modified pages of a process to minimize the checkpointing overhead. This means that in incremental checkpointing, the time consumed for checkpointing varies according to the amount of modified pages. Thus, the efficient interval of checkpointing must be determined on run-time of the process. In this paper, we present an efficient and adaptive page-level incremental checkpointing facility that is based on the cost analysis of process execution time. In our simulation, results show that the proposed mechanism significantly reduced the average process execution time compared with existing fixed-interval-based page-level incremental checkpointing.

Low-Cost Causal Message Logging based Recovery Algorithm Considering Asynchronous Checkpointing (비동기적 검사점 기록을 고려한 저 비용 인과적 메시지 로깅 기반 회복 알고리즘)

  • Ahn, Jin-Ho;Bang, Seong-Jun
    • The KIPS Transactions:PartA
    • /
    • v.13A no.6 s.103
    • /
    • pp.525-532
    • /
    • 2006
  • Compared with the previous recovery algorithms for causal message logging, Elnozahy's recovery algerian considerably reduces the number of stable storage accesses and enables live processes to execute their computations continuously while performing its recovery procedure. However, if causal message logging is used with asynchronous checkpointing, the state of the system may be inconsistent after having executed this algorithm in case of concurrent failures. In this paper, we show these inconsistent cases and propose a low-cost recovery algorithm for causal message logging to solve the problem. To ensure the system consistency, this algorithm allows the recovery leader to obtain recovery information from not only the live processes, but also the other recovering processes. Also, the proposed algorithm requires no extra message compared with Elnozahy's one and its additional overhead incurred by message piggybacking is significantly low. To demonstrate this, simulation results show that the first only increases about 1.0%$\sim$2.1% of the recovery information collection time compared with the latter.

A Striped Checkpointing Scheme for the Cluster System with the Distributed RAID (분산 RAID 기반의 클러스터 시스템을 위한 분할된 결함허용정보 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.123-130
    • /
    • 2003
  • This paper presents a new striped checkpointing scheme for serverless cluster computers, where the local disks are attached to the cluster nodes collectively form a distributed RAID with a single I/O space. Striping enables parallel I/O on the distributed disks and staggering avoids network bottleneck in the distributed RAID. We demonstrate how to reduce the checkpointing overhead and increase the availability by striping and staggering dynamically for communication intensive applications. Linpack HPC Benchamark and MPI programs are applied to these checkpointing schemes for performance evaluation on the 16-nodes cluster system. Benchmark results prove the benefits of the striped checkpointing scheme compare to the existing schemes, and these results are useful to design the efficient checkpointing scheme for fast rollback recovery from any single node failure in a cluster system.

Fault Recovery and Optimal Checkpointing Strategy for Dual Modular Redundancy Real-time Systems (중복구조 실시간 시스템에서의 고장 극복 및 최적 체크포인팅 기법)

  • Kwak, Seong-Woo
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.44 no.7 s.361
    • /
    • pp.112-121
    • /
    • 2007
  • In this paper, we propose a new checkpointing strategy for dual modular redundancy real-time systems. For every checkpoints the execution results from two processors, and the result saved in the previous checkpoint are compared to detect faults. We devised an operation algorithm in chectpoints to recover from transient faults as well as permanent faults. We also develop a Markov model for the optimization of the proposed checkpointing strategy. The probability of successful task execution within its deadline is derived from the Markov model. The optimal number of checkpoints is the checkpoints which makes the successful probability maximum.

Checkpointing and Rollback-Recovery Protocols in Distributed Computing Systems (분산 계산 환경의 검사점 작성 및 롤백 복구 프로토콜)

  • 안성준;조유근
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10c
    • /
    • pp.93-95
    • /
    • 1999
  • 메시지 전달을 이용한 분산 계산 환경의 검사점 작성 및 롤백 프로토콜은 조정 검사점 작성(coordinated checkpointing), 약조정 검사점, 작성(loosely coordinated checkpointing), 독립적 검사점 작성(independent checkpointint)등 크게 세 종류로 구분할 수 있다. 이 프로토콜들의 성능은 프로세스간 통신의 빈도, 통신의 패턴 등 응용의 특성 및 수행 환경에 영향을 받는다. 기존에 제안된 프로토콜 각각의 성능에 대해서는 많은 연구가 있었으나 이질적인 종류의 프로토콜들을 동일한 환경에서 구현하여 성능을 비교하는 연구는 이루어지지 않았다. 본 논문에서는 검사점 작성 및 롤백 복구 프로토콜들을 구현하고, 동일한 환경에서 성능을 측정한 결과를 제시한다. 아울러 검사점 작성 및 롤백 복구 프로토콜의 성능에 영향을 미치는 요소들을 분석하여, 이들 프로토콜의 성능 평가 기준과 응용의 특성에 적합한 프로토콜의 선택 기준을 제시한다.

  • PDF

Reducing Overhead of Distributed Checkpointing with Group Communication

  • Ahn, Jinho
    • Journal of Advanced Information Technology and Convergence
    • /
    • v.10 no.2
    • /
    • pp.83-90
    • /
    • 2020
  • A protocol HMNR, was proposed to utilize control information of every other process piggybacked on each sent message for minimizing the number of forced checkpoints. Then, an improved protocol, called Lazy-HMNR, was presented to lower the possibility of taking forced checkpoints incurred by the asymmetry between checkpointing frequencies of processes. Despite these two different minimization techniques, if the high message interaction traffic occurs, Lazy-HMNR may considerably lower the probability of knowing whether there occurs no Z-cycle due to its shortcomings. Also, we recognize that no previous work has smart procedures to be able to utilize network infrastructures for highly decreasing the number of forced checkpoints with dependency information carried on every application message. We introduce a novel Lazy-HMNR protocol for group communication-based distributed computing systems to cut back the number of forced checkpoints in a more effective manner. Our simulation outcomes showed that the proposed protocol may highly lessen the frequency of forced checkpoints by comparison to Lazy-HMNR.