A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems

결함 내성 분산 시스템에서의 동적 검사점 스케쥴링 기법

  • Published : 2002.02.01

Abstract

The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.

분산 시스템에 결함 내성 기능을 제공하는 기법의 하나인, 검사점을 이용한 회복 기법을 효율 적으로 구현하기 위해서는 최적화된 검사점 설정 구간의 선택이 매우 중요한 문제로 인식되고 있다. 본 논문은 분산 시스템내의 각 프로세스 적절한 검사점 설정 구간을 프로세스의 연산 중에서 동적으로 스케 쥴링 하는 기법을 제안한다. 제안된 기법에서는 시스템내에의 각 프로세스가 현 검사점 구간 동안으 검사점 설정 비용과 가능한 롤백 회복 비용을 비교 평가하고, 다음 검사점 설정을 위한 적절한 구간을 계산한다. 대부분의 기존 기법들과는 달리 제안된 기법은 검사점과 롤백 두 가지 비용 모두를 최소화는 구간 값 을 선택하여 , 현 검사점 구간 동안의 통신 형태를 고려한 구간 값을 선택한다. 또한 검사점 설정 구간 선 택을 위한 별도의 통신비용의 요구되지 않으며, 제안된 기법의 기존의 검사점 조정 기법들과 쉽게 통합되어 사용될수 있다.

Keywords

References

  1. K.G. Shin, T. Lin, and Y. Lee, 'Optimal checkpointing of real-time tasks,' IEEE Trans. on Computers, Vol. C36, No. 11, pp, 1328-1341, 1987
  2. A.N. Tantawi and M. Ruschitzka, 'Performance analysis of checkpointing strategies,' ACM Trans. on Computer Systems, Vol. 2, No.2, pp. 123-144, 1984 https://doi.org/10.1145/190.357398
  3. A. Ziv and J. Bruck, 'Analysis of checkpointing schemes for multiprocessor systems,' in Proc. of the 13th Symp. on Reliable Distributed Systems, pp. 52-61, 1994 https://doi.org/10.1109/RELDIS.1994.336909
  4. A.P. Sistla and J.L. Welch, 'Efficient distributed recovery using message logging,' in Proc. of the 8th ACM Symp. on Principles of Distributed Computing, pp. 223-238, 1989 https://doi.org/10.1145/72981.72997
  5. J.L. Kim and T. Park, 'An efficient algorithm for checkpointing recovery in distributed systems,' IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No.8, pp. 955-960, 1993 https://doi.org/10.1109/71.238629
  6. B.L. Randell, P.A. Lee, and P.C. Treleaven, 'Reliability issue in computing system design,' ACM Computing Surveys, Vol. 2, pp. 123-166, 1978 https://doi.org/10.1145/356725.356729
  7. T. Park and J.L. Kim, 'Domino-effect free checkpointing recovery in distributed systems,' in Proc. of the 7th Int'l Conf. on Parallel and Distributed Computing Systems, pp, 497-502, 1994
  8. D. Briatico, A. Ciuffoletti, and L. Simoncini, 'A distributed domino-effect free recovery algorithm,' in Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems, pp, 207-215, 1984
  9. R. Koo and S. Toueg, 'Checkpointing and rollback-recovery for distributed systems,' IEEE Trans. on Software Engineering, Vol. SE-13, No. 1, pp. 23-31, 1987 https://doi.org/10.1109/TSE.1987.232562
  10. Y. Tamir and C.H. Sequin, 'Error recovery in multicomputers using global checkpoints,' in Proc. of the 14th IEEE Symp. on Fault-Tolerant Computing, pp. 32-41, 1984
  11. K. Venkatesh, T. Radhakrishan, and H.F. Li, 'Optimal checkpointing and local recording for domino-free rollback recovery,' Information Processing Letters, Vol. 25. pp. 295-303, 1987 https://doi.org/10.1016/0020-0190(87)90203-1
  12. Y.M. Wang and W.K. Fuchs, 'Lazy checkpoint coordination for bounding rollback propagation,' in Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 78-85, 1993 https://doi.org/10.1109/RELDIS.1993.393471
  13. R.D. Schelichting and F.B. Schneide, 'Fail-stop processors: An approach to designing fault-tolerant computing systems,' ACM Trans. of Computer Systems, Vol. 1, No.3, pp. 222-238, 1983 https://doi.org/10.1145/357369.357371
  14. B. Bhargava and S. Lian, 'Independent checkpointing and concurrent rollback for recovery in distributed systems - An optimistic approach,' in Proc. of the 7th IEEE Symp. on Reliable Distributed Systems, pp. 3-12, 1988 https://doi.org/10.1109/RELDIS.1988.25775