Browse > Article
http://dx.doi.org/10.6109/jkiice.2022.26.1.170

Analysis of Checkpointing Model with Instantaneous Error Detection  

Lee, Yutae (Department of Information and Communications Engineering, Dong-eui University)
Abstract
Reactive failure management techniques are required to mitigate the impact of errors in high performance computing. Checkpoint is the standard recovery technique for coping with errors. An application employing checkpoints periodically saves its state, so that when an error occurs while some task is executing, the application is rolled back to its last checkpointed task and resumes execution from that task onward. In this paper, assuming the time-to-errors are independent each other and generally distributed, we analyze the checkpointing model with instantaneous error detection. The conventional assumption that two or more errors do not take place between two consecutive checkpoints is removed. Given the checkpointing time, down-time, and recovery time, we derive the reliability of the checkpointing model. When the time-to-error follows an exponential distribution, we obtain the optimal checkpointing interval to achieve the maximum reliability.
Keywords
Checkpointing; Instantaneous error detection; Reliability; Mathematical analysis;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Generation Computer Systems, vol. 22, no. 3, pp. 303-312, 2004.   DOI
2 A. Benoit, A. Cavelan, Y. Robert, and H. Sun, "Multi-level checkpointing and silent error detection for linear workflows," Journal of Computational Science, vol. 28, pp. 398-415, Arp. 2017.   DOI
3 Y. Du, L. Marchal, G. Pallez, and Y. Robert, "Optimal checking strategies for iterative applications," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 3, pp. 507-522, Mar. 2022.   DOI
4 A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, "Coping with silent and fail-stop errors at scale by combining replication and checkpointing," Journal of Parallel and Distributed Computing, vol. 122, no. 1, pp. 209-225, Aug. 2018.   DOI
5 Y. Lee, "Reliability analysis of checkpointing model with multiple verification mechanism," Bulletin of the Korean Mathematical Society, vol. 56, no. 6, pp. 1435-1445, Nov. 2019.   DOI
6 J. W. Young, "A first order approximation to the optimal checkpoint interval," Communications of the ACM, vol. 17, no. 9, pp. 530-531, Sept. 1974.   DOI
7 M. S. Bouguerra, D. Trystram, and F. Wagner, "Complexity analysis of checkpoint scheduling with variable costs," IEEE Transactions on Computers, vol. 62, no. 6, pp. 1269-1275, Mar. 2013.   DOI