• Title/Summary/Keyword: Rollback Recovery

Search Result 31, Processing Time 0.032 seconds

Design and Implementation of Reliable Distributed Programming Environment based on HORB (HORB에 기반한 신뢰성 있는 분산 프로그래밍 환경의 설계 및 구현)

  • Hyun, Mu-Yong;Kim, Shik;Kim, Myung-Jun
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.39 no.2
    • /
    • pp.1-9
    • /
    • 2002
  • The use of Object-Oriented Distributed Programming(OODP) environment such as DCOM, DSOM, Java RMI, CORBA to implement distributed applications is becoming increasingly popular. However, absence of a fault-tolerance feature in these middleware platforms complicates the design and implementation of reliable distributed object-based applications, although they greatly enhance the quality and reusability of the distributed object-based applications. In this paper, we propose a fault-tolerant programming environment based on RMI, namely Evergreen, for the reliable distributed computing with checkpoints and rollback-recovery mechanism. Based on a series of experiments, we evaluate the performance of Evergreen and find its possibility of extension to fully support our optimal design goal.

Efficient Process Checkpointing through Fine-Grained COW Management in New Memory based Systems (뉴메모리 기반 시스템에서 세밀한 COW 관리 기법을 통한 효율적 프로세스 체크포인팅 기법)

  • Park, Jay H.;Moon, Young Je;Noh, Sam H.
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.132-138
    • /
    • 2017
  • We design and implement a process-based fault recovery system to increase the reliability of new memory based computer systems. A rollback point is made at every context switch to which a process can rollback to upon a fault. In this study, a clone process of the original process, which we refer to as a P-process (Persistent-process), is created as a rollback point. Such a design minimizes losses when a fault does occur. Specifically, first, execution loss can be minimized as rollback points are created only at context switches, which bounds the lost execution. Second, as we make use of the COW (Copy-On-Write)mechanism, only those parts of the process memory state that are modified (in page units) are copied decreasing the overhead for creating the P-process. Our experimental results show that the overhead is approximately 5% in 8 out of 11 PARSEC benchmark workloads when P-process is created at every context switch time. Even for workloads that result in considerable overhead, we show that this overhead can be reduced by increasing the P-process generation interval.

Garbage Collection Protocol of Fault Tolerance Information in Multi-agent Environments (멀티에이전트 환경에서 결함 포용 정보의 쓰레기 처리 기법)

  • 이대원;정광식;이화민;신상철;이영준;유헌창;이원규
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.3_4
    • /
    • pp.204-212
    • /
    • 2004
  • Existing distributed systems have higher probability of failures occurrence than stand-alone system, so many fault tolerant techniques have been developed. Because of insufficient storage resulting from the increased fault tolerance information stored, the performance of system has been degraded. To avoid performance degradation, it needs delete useless fault tolerance information. In this paper, we propose a garbage collection algorithm for fault tolerance information. And we define and design the garbage collection agent for garbage collection of fault tolerance information, the information agent for management of fault tolerant data, and the facilitator agent for communication between agents. Also, we propose the garbage collection algorithm using the garbage collection agent. For rollback recovery, we use independent checkpointing protocol and sender based pessimistic message logging protocol. In our proposed garbage collection algorithm, the garbage collection, information, and facilitator agent is created with process, and the information agent constructs domain knowledge with its checkpoints and non-determistic events. And the garbage collection agent decides garbage collection time, and it deletes useless fault tolerance information in cooperation with the information and facilitator agent. For propriety of proposed garbage collection technique using agents, we compare domain knowledge of system that performs garbage collection after rollback recovery and domain knowledge of system that doesn't perform garbage collection.

Design and Implementation of a Recovery Method for High Dimensional Index Structures (고차원 색인구조를 위한 회복기법의 설계 및 구현)

  • Song, Seok-Il;Lee, Seok-Hui;Yu, Jae-Su
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.7
    • /
    • pp.2008-2019
    • /
    • 2000
  • In this paper, we propose a recovery method for high dimensional index structures. It recovers efficiently transactions including reinsert operations that needs undo or rollback due to system failures or transaction failures. It is based on WAL(Write Ahead Logging) protocol. We apply the method to the FCIR-Tree and implement it based on MiDAS-III which is the storage system of a multimedia DBMS, called BADA-III. We also show through performance evaluation that the recovery method with our algorithm recovers reinsert operations efficiently over that without our algorithm.

  • PDF

Lazy Garbage Collection of Coordinated Checkpointing Protocol for Avoiding Sympathetic Rollback (동기적 검사점 기법에서 불필요한 복귀를 회피하기 위한 쓰레기 처리 기법)

  • Chung, Kwang-Sik;Yu, Heon-Chang;Lee, Won-Gyu;Lee, Seong-Hoon;Hwang, Chong-Sun
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.6
    • /
    • pp.331-339
    • /
    • 2002
  • This paper presents a garbage collection protocol for checkpoints and message logs which are staved on the stable storage or volatile storage for fault tolerancy. The previous works of garbage collections in coordinated checkpointing protocol delete all the checkpoints except for the last checkpoints on earth processes. But implemented in top of reliable communication protocol like as TCP/IP, rollback recovery protocol based on only last checkpoints makes sympathetic rollback. We show that the old checkpoints or message logs except for the last checkpoints have to be preserved in order to replay the lost message. And we define the conditions for garbage collection of checkpoints and message logs for lost messages and present the garbage collection algorithm for checkpoints and message logs in coordinated checkpointing protocol. Since the proposed algorithm uses process information for lost message piggybacked with messages, the additional messages for garbage collection is not required The proposed garbage collection algorithm makes 'the lazy garbage collectioneffect', because relying on the piggybacked checked checkpoint information in send/receive message. But 'the lazy garbage collection effect'does not break the consistency of the whole systems.

A Multistriped Checkpointing Scheme for the Fault-tolerant Cluster Computers (다중 분할된 구조를 가지는 클러스터 검사점 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.13A no.7 s.104
    • /
    • pp.607-614
    • /
    • 2006
  • The checkpointing schemes should reduce the process delay through managing the checkpoints of each node to fit the network load to enhance the performance of the process running on the cluster system that write the checkpoints into its global stable storage. For this reason, a cluster system with single IO space on a distributed RAID chooses a suitable checkpointng scheme to get the maximum IO performance and the best rollback recovery efficiency. In this paper, we improved the striped checkpointing scheme with dynamic stripe group size by adapting to the network bandwidth variation at the point of checkpointing. To analyze the performance of the multi striped checkpointing scheme, we applied Linpack HPC benchmark with MPI on our own cluster system with maximum 512 virtual nodes. The benchmark results showed that the multistriped checkpointing scheme has better performance than the striped checkpointing scheme on the checkpoint writing efficiency and rollback recovery at heavy system load.

An Efficient Checkpointing Method for Mobile Hosts via the Software Agent (이동 기기에 적합한 소프트웨어 에이전트 기반의 효율적 체크포인팅 기법)

  • Lim, Sung-Chae
    • The KIPS Transactions:PartA
    • /
    • v.15A no.2
    • /
    • pp.111-118
    • /
    • 2008
  • With the advance in mobile communication systems, the need for distributed applications running on multiple mobile devices also grows gradually. As such applications are subject to H/W failures of the mobile device or communication disruptions, compared to the traditional applications in fixed networks, it is crucial to develop any recovery mechanism suitable for them. For this, checkpointing is widely used to restart interrupted applications. In this paper, we devise an efficient checkpointing method that adopts the software agent executed at the mobile support station. The agent, called the checkpointing agent, is aimed at supporting the concept of rollback-distance (R-distance) that bounds the maximum number of roll-backed local checkpoints. By means of the R-distance, our method can prevent undesirable domino effects and heavy checkpoint overhead, while providing high flexibility in checkpoint creation.

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems (결함 내성 분산 시스템에서의 동적 검사점 스케쥴링 기법)

  • Park, Tae-Soon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.2
    • /
    • pp.75-86
    • /
    • 2002
  • The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.

Fault-Tolerant Parallel Applications in Java Message Passing Systems (자바 메시지 전달 시스템에서의 결함 포용 병렬 애플리케이션)

  • 안진호;김기범;김정훈;황종선
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10a
    • /
    • pp.768-770
    • /
    • 1998
  • 동기적 검사점(synchronous checkpoiting)기법, 인과적 메시지 로깅(causal message logging)과 향상된 회복 비동기성(improved asynchronism during recovery)을 제공하는 복귀회복(rollback recovery) 기법을 적용하여 자바 메시지 전달 시스템(java massage passing system)에서 수행하는 병렬 에플리케이션들에게 저 비용의 결함 포용성에 따라, 통신망으로 연결된 이질형 (fault-tolerance)(heterogeneous) 컴퓨터들을 이용하는 대규모 분산 시스템들은 아주 효율적인 병렬 컴퓨팅 환경을 제공해준다. 그러나, 이러한 분산 시스템들의 규모가 커짐에 따라 고장률 (failure rate)도 그 만큼 중요하게 된다. 따라서, 고장률이 높은 대규모 분산 시스템들에게 좀더 효율적인 결함 포용성을 제공하는 기법들이 필요하다. 또한, 대규모분산 시스템들은 이질형 컴퓨터들로 구성되어 있기 때문에, 결함 포용성을 제공하는 소프트웨어 패키지들은 플랫폼 독립적(platform independent)이어야 한다. 이러한 문제점은 높은 이식성(portability)을 가지고 있는 자바 언어로 구현함으로써 해결될 수 있다. 따라서, 본 논문은 자바 메시지 전달 시스템에서 수행되는 병렬 애플리케이션들에게 동기적 검사점 기법, 인과적 메시지 로깅과 향상된 비동기성을 제공하는 복귀회복 기법을 높은 이식성을 가진 자바언어로 구현하여 저 비용으로 결함 포용성을 제공하고자 한다.

  • PDF

Determination of Optimal Checkpoint Interval for RM Scheduled Real-time Tasks (RM 스케줄링된 실시간 태스크에서의 최적 체크 포인터 구간 선정)

  • Kwak, Seong-Woo;Jung, Young-Joo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.56 no.6
    • /
    • pp.1122-1129
    • /
    • 2007
  • For a system with multiple real-time tasks of different deadlines, it is very difficult to find the optimal checkpoint interval because of the complexity in considering the scheduling of tasks. In this paper, we determine the optimal checkpoint interval for multiple real-time tasks that are scheduled by RM(Rate Monotonic) algorithm. Faults are assumed to occur with Poisson distribution. Checkpoints are inserted in the execution of task with equal distance in the same task, but different distances in other tasks. When faults occur, rollback to the latest checkpoint and re-execute task after the checkpoint. We derive the equation of maximum slack time for each task, and determine the number of re-executable checkpoint intervals for fault recovery. The equation to check the schedulibility of tasks is also derived. Based on these equations, we find the probability of all tasks executed within their deadlines successfully. Checkpoint intervals which make the probability maximum is the optimal.