Browse > Article

New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints  

Woo Namyoon (서울대학교 컴퓨터공학부)
Yeom Heon Young (서울대학교 컴퓨터공학부)
Park Taesoon (세종대학교 컴퓨터공학부)
Abstract
Communication induced checkpointing (CIC) is one of the checkpointing techniques to provide fault tolerance for distributed systems. Independent checkpoints that each distributed process produces without coordination are likely to be useless. Useless checkpoints, which cannot belong to any consistent global checkpoint sets, induce nondeterminant rollback. To prevent the useless checkpoints, CIC forces processes to take additional checkpoints at proper moment. The number of those forced checkpoints is the main source of failure-free overhead in CIC. In this paper, we present two new CIC protocols which satisfy 'No Z-Cycle (NZC)'property. The proposed protocols reduce the number of forced checkpoints compared to the existing protocols with the drawback of the increase in message delay. Our simulation results with the synthetic data show that the proposed protocols have lower failure-free overhead than the existing protocols. Additionally, we show that the classical 'index-based checkpointing' protocols are inefficient in constructing the consistent global cut in distributed executions.
Keywords
fault tolerance; communication induced checkpointing; Z-cycle detection;
Citations & Related Records
연도 인용수 순위
  • Reference
1 R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems, 6(2):165-169, 1995   DOI   ScienceOn
2 F. Quaglia, R. Baldoni, and B. Ciciani. On the no-z-cycle property in distributed executions. Journal of Computer and System Sciences, 61(3): 400-427, 2000   DOI   ScienceOn
3 Y. Nah. The Specification of Task Communication Patterns. PhD thesis, Seoul National University, Korea, 1997
4 G. Andrews. Paradigms for process interaction in distributed programs. ACM Computing Surveys, 23(1):49-90, 1991   DOI
5 R. Baldoni, F. Quaglia, and B. Ciciani. A VP-accordant checkpointing protocol preventing useless checkpoints. In Symposium on Reliable Distributed Systems, pages 61-67, 1998   DOI
6 R. Baldoni, J. H'elary, and M. Raynal. Rollback-dependency trackability. Technical Report Report 1107, IRISA Research, MAY 1997
7 L. Lamport, 'Time, Clocks, and the Ordering of Events in a Distributed System,' Comm. of the ACM, Vol.21, No.7, pp.558-564, Jul., 1978   DOI   ScienceOn
8 D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In Proceedings of the IEEE International Symposium on Reliability Distributed Software wand Database, pages 207-215, DEC 1984
9 J. Helary, A. Mostefaoui, R. Netzer, and M. Raynal. Preventing useless checkpoints in distributed computations. In Proceedings of IEEE International Symposium on Reliable Distributed Systems, pages 183-190, 1997   DOI
10 Y. -M. Wang and W. K. Fuchs. Optimistic message logging for independent checkpointing in message-passing systems. In Symposium on Reliable Distributed Systems, pages 147-154, 1992   DOI
11 L. Alvisi, E. N. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242-249, 1999   DOI
12 T. Park and H. Y. Yeom. Application controlled checkpointing coordination for fault tolerant distributed computing systems. Parallel Computing, 26(4):467-482, MAR 2000   DOI   ScienceOn
13 L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic and causal. In Proceedings of the 15th International Conference on Distributed Computing Systems, pages 229-236, 1995   DOI
14 N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. In Symposium on Fault-Tolerant Computing, pages 58-67, 1998   DOI
15 R. Koo and S. Toueg. Checkpointing and rollback recovery for distributed systems. IEEE Transaction on Software Engineering, SE-13(1):23-31, 1987   DOI
16 E. N. Elnozahy, L. Alvisi, Y. -M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, oct 1996
17 K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63-75, AUG 1985   DOI   ScienceOn