[KSCI] Korea Science Citation Index Service

New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints

Woo Namyoon (서울대학교 컴퓨터공학부)
Yeom Heon Young (서울대학교 컴퓨터공학부)
Park Taesoon (세종대학교 컴퓨터공학부)

Publication Information

Journal of KIISE:Computer Systems and Theory / v.31, no.12, 2004 , pp. 692-703 More about this Journal

Abstract

Communication induced checkpointing (CIC) is one of the checkpointing techniques to provide fault tolerance for distributed systems. Independent checkpoints that each distributed process produces without coordination are likely to be useless. Useless checkpoints, which cannot belong to any consistent global checkpoint sets, induce nondeterminant rollback. To prevent the useless checkpoints, CIC forces processes to take additional checkpoints at proper moment. The number of those forced checkpoints is the main source of failure-free overhead in CIC. In this paper, we present two new CIC protocols which satisfy 'No Z-Cycle (NZC)'property. The proposed protocols reduce the number of forced checkpoints compared to the existing protocols with the drawback of the increase in message delay. Our simulation results with the synthetic data show that the proposed protocols have lower failure-free overhead than the existing protocols. Additionally, we show that the classical 'index-based checkpointing' protocols are inefficient in constructing the consistent global cut in distributed executions.

Keywords

fault tolerance; communication induced checkpointing; Z-cycle detection;

Citations & Related Records

Reference

1	R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems, 6(2):165-169, 1995 DOI ScienceOn
2	F. Quaglia, R. Baldoni, and B. Ciciani. On the no-z-cycle property in distributed executions. Journal of Computer and System Sciences, 61(3): 400-427, 2000 DOI ScienceOn
3	Y. Nah. The Specification of Task Communication Patterns. PhD thesis, Seoul National University, Korea, 1997
4	G. Andrews. Paradigms for process interaction in distributed programs. ACM Computing Surveys, 23(1):49-90, 1991 DOI
5	R. Baldoni, F. Quaglia, and B. Ciciani. A VP-accordant checkpointing protocol preventing useless checkpoints. In Symposium on Reliable Distributed Systems, pages 61-67, 1998 DOI
6	R. Baldoni, J. H'elary, and M. Raynal. Rollback-dependency trackability. Technical Report Report 1107, IRISA Research, MAY 1997
7	L. Lamport, 'Time, Clocks, and the Ordering of Events in a Distributed System,' Comm. of the ACM, Vol.21, No.7, pp.558-564, Jul., 1978 DOI ScienceOn
8	D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In Proceedings of the IEEE International Symposium on Reliability Distributed Software wand Database, pages 207-215, DEC 1984
9	J. Helary, A. Mostefaoui, R. Netzer, and M. Raynal. Preventing useless checkpoints in distributed computations. In Proceedings of IEEE International Symposium on Reliable Distributed Systems, pages 183-190, 1997 DOI
10	Y. -M. Wang and W. K. Fuchs. Optimistic message logging for independent checkpointing in message-passing systems. In Symposium on Reliable Distributed Systems, pages 147-154, 1992 DOI
11	L. Alvisi, E. N. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242-249, 1999 DOI
12	T. Park and H. Y. Yeom. Application controlled checkpointing coordination for fault tolerant distributed computing systems. Parallel Computing, 26(4):467-482, MAR 2000 DOI ScienceOn
13	L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic and causal. In Proceedings of the 15th International Conference on Distributed Computing Systems, pages 229-236, 1995 DOI
14	N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. In Symposium on Fault-Tolerant Computing, pages 58-67, 1998 DOI
15	R. Koo and S. Toueg. Checkpointing and rollback recovery for distributed systems. IEEE Transaction on Software Engineering, SE-13(1):23-31, 1987 DOI
16	E. N. Elnozahy, L. Alvisi, Y. -M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, oct 1996
17	K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63-75, AUG 1985 DOI ScienceOn

KSCI

New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints 통신 유형 변형을 이용하여 검사점 생성 개수를 개선한 검사점 Z-Cycle 검출 기법

New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints