• Title/Summary/Keyword: checkpointing/rollback

Search Result 16, Processing Time 0.017 seconds

A Striped Checkpointing Scheme for the Cluster System with the Distributed RAID (분산 RAID 기반의 클러스터 시스템을 위한 분할된 결함허용정보 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.123-130
    • /
    • 2003
  • This paper presents a new striped checkpointing scheme for serverless cluster computers, where the local disks are attached to the cluster nodes collectively form a distributed RAID with a single I/O space. Striping enables parallel I/O on the distributed disks and staggering avoids network bottleneck in the distributed RAID. We demonstrate how to reduce the checkpointing overhead and increase the availability by striping and staggering dynamically for communication intensive applications. Linpack HPC Benchamark and MPI programs are applied to these checkpointing schemes for performance evaluation on the 16-nodes cluster system. Benchmark results prove the benefits of the striped checkpointing scheme compare to the existing schemes, and these results are useful to design the efficient checkpointing scheme for fast rollback recovery from any single node failure in a cluster system.

A Study for Checkpointing Schemes based on a TMR System (TMR 시스템 기반의 Checkpointing 기법에 관한 연구)

  • Kim, Tae-Wook;Kang, Myung-Seok;Kim, Hag-Bae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2003.11a
    • /
    • pp.397-400
    • /
    • 2003
  • TMR(Triple Modular redundancy)은 공간여분(W/H 및 S/W)을 정적으로 활용하는 가장 간단한 구조를 지닌 대표적인 고장포용 기법중의 하나이다. TMR 구조 고장시 TMR 시스템 고장복구를 위해 잘못된 결과를 가지고 있는 프로그램의 일부분을 재실행 또는 프로그래밍 전체를 재시작하는 기법을 적용하는 것은 일반적으로 상당한 시간을 필요로 한다. 이러한 단점을 극복하기 위해 본 논문에서는 TMR 고장을 효과적으로 복구하기 위해 또 다른 형태의 시간여분 기법인 rollback과 rol1-forward 기법에 checkpoint들을 적용하여 처리하는 시간 및 공간여분을 혼용하는 기법을 제안하였다.

  • PDF

An Application-Level Fault Tolerant System For Synchronous Parallel Computation (동기 병렬연산을 위한 응용수준의 결함 내성 연산시스템)

  • Park, Pil-Seong
    • Journal of Internet Computing and Services
    • /
    • v.9 no.5
    • /
    • pp.185-193
    • /
    • 2008
  • An MTBF(mean time between failures) of large scale parallel systems is known to be only an order of several hours, and large computations sometimes result in a waste of huge amount of CPU time, However. the MPI(Message Passing Interface), a de facto standard for message passing parallel programming, suggests no possibility to handle such a problem. In this paper, we propose an application-level fault tolerant computation system, purely on the basis of the current MPI standard without using any non-standard fault tolerant MPI library, that can be used for general scientific synchronous parallel computation.

  • PDF

Optimizing Checkpoint Intervals for Real-Time Multi-Tasks with Arbitrary Periods (임의 주기를 가지는 실시간 멀티 태스크를 위한 체크포인트 구간 최적화)

  • Kwak, Seong-Woo;Yang, Jung-Min
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.60 no.1
    • /
    • pp.193-200
    • /
    • 2011
  • This paper presents an optimal checkpoint strategy for fault-tolerance in real-time systems. In our environment, multiple real-time tasks with arbitrary periods are scheduled in the system by Rate Monotonic (RM) algorithm, and checkpoints are inserted at a constant interval in each task while the width of interval is different with respect to the task. We propose a method to determine the optimal checkpoint interval for each task so that the probability of completing all the tasks is maximized. Whenever a fault occurs to a checkpoint interval of a task, the execution time of the task would be prolonged by rollback and re-execution of checkpoints. Our scheme includes the schedulability test to examine whether a task can be completed with an extended execution time. A numerical experiment is conducted to demonstrate the applicability of the proposed scheme.

Garbage Collection Protocol of Fault Tolerance Information in Multi-agent Environments (멀티에이전트 환경에서 결함 포용 정보의 쓰레기 처리 기법)

  • 이대원;정광식;이화민;신상철;이영준;유헌창;이원규
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.3_4
    • /
    • pp.204-212
    • /
    • 2004
  • Existing distributed systems have higher probability of failures occurrence than stand-alone system, so many fault tolerant techniques have been developed. Because of insufficient storage resulting from the increased fault tolerance information stored, the performance of system has been degraded. To avoid performance degradation, it needs delete useless fault tolerance information. In this paper, we propose a garbage collection algorithm for fault tolerance information. And we define and design the garbage collection agent for garbage collection of fault tolerance information, the information agent for management of fault tolerant data, and the facilitator agent for communication between agents. Also, we propose the garbage collection algorithm using the garbage collection agent. For rollback recovery, we use independent checkpointing protocol and sender based pessimistic message logging protocol. In our proposed garbage collection algorithm, the garbage collection, information, and facilitator agent is created with process, and the information agent constructs domain knowledge with its checkpoints and non-determistic events. And the garbage collection agent decides garbage collection time, and it deletes useless fault tolerance information in cooperation with the information and facilitator agent. For propriety of proposed garbage collection technique using agents, we compare domain knowledge of system that performs garbage collection after rollback recovery and domain knowledge of system that doesn't perform garbage collection.

An Implementation of Fault Tolerant Software Distributed Shared Memory with Remote Logging (원격 로깅 기법을 이용하는 고장 허용 소프트웨어 분산공유메모리 시스템의 구현)

  • 박소연;김영재;맹승렬
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.5_6
    • /
    • pp.328-334
    • /
    • 2004
  • Recently, Software DSMs continue to improve its performance and scalability As Software DSMs become attractive on larger clusters, the focus of attention is likely to move toward improving the reliability of a system. A popular approach to tolerate failures is message logging with checkpointing, and so many log-based rollback recovery schemes have been proposed. In this work, we propose a remote logging scheme which uses the volatile memory of a remote node assigned to each node. As our remote logging does not incur frequent disk accesses during failure-free execution, its logging overhead is not significant especially over high-speed communication network. The remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of DSMs grow much higher. We have designed and implemented the FT-KDSM(Fault Tolerant KAIST DSM) with the remote logging and showed the logging overhead and the recovery time.