Browse > Article

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems  

Park, Tae-Soon (세종대학교 컴퓨터공학과)
Abstract
The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.
Keywords
Fault-Tolerant System; Distributed System; Checkpointing; Rollbacdlk-Recovery; Dynamic Scheduling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 B. Bhargava and S. Lian, 'Independent checkpointing and concurrent rollback for recovery in distributed systems - An optimistic approach,' in Proc. of the 7th IEEE Symp. on Reliable Distributed Systems, pp. 3-12, 1988   DOI
2 K. Venkatesh, T. Radhakrishan, and H.F. Li, 'Optimal checkpointing and local recording for domino-free rollback recovery,' Information Processing Letters, Vol. 25. pp. 295-303, 1987   DOI   ScienceOn
3 Y.M. Wang and W.K. Fuchs, 'Lazy checkpoint coordination for bounding rollback propagation,' in Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 78-85, 1993   DOI
4 R.D. Schelichting and F.B. Schneide, 'Fail-stop processors: An approach to designing fault-tolerant computing systems,' ACM Trans. of Computer Systems, Vol. 1, No.3, pp. 222-238, 1983   DOI
5 R. Koo and S. Toueg, 'Checkpointing and rollback-recovery for distributed systems,' IEEE Trans. on Software Engineering, Vol. SE-13, No. 1, pp. 23-31, 1987   DOI
6 Y. Tamir and C.H. Sequin, 'Error recovery in multicomputers using global checkpoints,' in Proc. of the 14th IEEE Symp. on Fault-Tolerant Computing, pp. 32-41, 1984
7 J.L. Kim and T. Park, 'An efficient algorithm for checkpointing recovery in distributed systems,' IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No.8, pp. 955-960, 1993   DOI   ScienceOn
8 T. Park and J.L. Kim, 'Domino-effect free checkpointing recovery in distributed systems,' in Proc. of the 7th Int'l Conf. on Parallel and Distributed Computing Systems, pp, 497-502, 1994
9 D. Briatico, A. Ciuffoletti, and L. Simoncini, 'A distributed domino-effect free recovery algorithm,' in Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems, pp, 207-215, 1984
10 A.P. Sistla and J.L. Welch, 'Efficient distributed recovery using message logging,' in Proc. of the 8th ACM Symp. on Principles of Distributed Computing, pp. 223-238, 1989   DOI
11 K.G. Shin, T. Lin, and Y. Lee, 'Optimal checkpointing of real-time tasks,' IEEE Trans. on Computers, Vol. C36, No. 11, pp, 1328-1341, 1987
12 B.L. Randell, P.A. Lee, and P.C. Treleaven, 'Reliability issue in computing system design,' ACM Computing Surveys, Vol. 2, pp. 123-166, 1978   DOI   ScienceOn
13 A.N. Tantawi and M. Ruschitzka, 'Performance analysis of checkpointing strategies,' ACM Trans. on Computer Systems, Vol. 2, No.2, pp. 123-144, 1984   DOI
14 A. Ziv and J. Bruck, 'Analysis of checkpointing schemes for multiprocessor systems,' in Proc. of the 13th Symp. on Reliable Distributed Systems, pp. 52-61, 1994   DOI