[KSCI] Korea Science Citation Index Service

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems

Park, Tae-Soon (세종대학교 컴퓨터공학과)

Publication Information

Journal of KIISE:Computer Systems and Theory / v.29, no.2, 2002 , pp. 75-86 More about this Journal

Abstract

The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.

Keywords

Fault-Tolerant System; Distributed System; Checkpointing; Rollbacdlk-Recovery; Dynamic Scheduling;

Citations & Related Records

Reference

1	B. Bhargava and S. Lian, 'Independent checkpointing and concurrent rollback for recovery in distributed systems - An optimistic approach,' in Proc. of the 7th IEEE Symp. on Reliable Distributed Systems, pp. 3-12, 1988 DOI
2	K. Venkatesh, T. Radhakrishan, and H.F. Li, 'Optimal checkpointing and local recording for domino-free rollback recovery,' Information Processing Letters, Vol. 25. pp. 295-303, 1987 DOI ScienceOn
3	Y.M. Wang and W.K. Fuchs, 'Lazy checkpoint coordination for bounding rollback propagation,' in Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 78-85, 1993 DOI
4	R.D. Schelichting and F.B. Schneide, 'Fail-stop processors: An approach to designing fault-tolerant computing systems,' ACM Trans. of Computer Systems, Vol. 1, No.3, pp. 222-238, 1983 DOI
5	R. Koo and S. Toueg, 'Checkpointing and rollback-recovery for distributed systems,' IEEE Trans. on Software Engineering, Vol. SE-13, No. 1, pp. 23-31, 1987 DOI
6	Y. Tamir and C.H. Sequin, 'Error recovery in multicomputers using global checkpoints,' in Proc. of the 14th IEEE Symp. on Fault-Tolerant Computing, pp. 32-41, 1984
7	J.L. Kim and T. Park, 'An efficient algorithm for checkpointing recovery in distributed systems,' IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No.8, pp. 955-960, 1993 DOI ScienceOn
8	T. Park and J.L. Kim, 'Domino-effect free checkpointing recovery in distributed systems,' in Proc. of the 7th Int'l Conf. on Parallel and Distributed Computing Systems, pp, 497-502, 1994
9	D. Briatico, A. Ciuffoletti, and L. Simoncini, 'A distributed domino-effect free recovery algorithm,' in Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems, pp, 207-215, 1984
10	A.P. Sistla and J.L. Welch, 'Efficient distributed recovery using message logging,' in Proc. of the 8th ACM Symp. on Principles of Distributed Computing, pp. 223-238, 1989 DOI
11	K.G. Shin, T. Lin, and Y. Lee, 'Optimal checkpointing of real-time tasks,' IEEE Trans. on Computers, Vol. C36, No. 11, pp, 1328-1341, 1987
12	B.L. Randell, P.A. Lee, and P.C. Treleaven, 'Reliability issue in computing system design,' ACM Computing Surveys, Vol. 2, pp. 123-166, 1978 DOI ScienceOn
13	A.N. Tantawi and M. Ruschitzka, 'Performance analysis of checkpointing strategies,' ACM Trans. on Computer Systems, Vol. 2, No.2, pp. 123-144, 1984 DOI
14	A. Ziv and J. Bruck, 'Analysis of checkpointing schemes for multiprocessor systems,' in Proc. of the 13th Symp. on Reliable Distributed Systems, pp. 52-61, 1994 DOI

KSCI

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems 결함 내성 분산 시스템에서의 동적 검사점 스케쥴링 기법

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems