Browse > Article
http://dx.doi.org/10.3745/KIPSTA.2003.10A.2.123

A Striped Checkpointing Scheme for the Cluster System with the Distributed RAID  

Chang, Yun-Seok (대진대학교 컴퓨터공학과)
Abstract
This paper presents a new striped checkpointing scheme for serverless cluster computers, where the local disks are attached to the cluster nodes collectively form a distributed RAID with a single I/O space. Striping enables parallel I/O on the distributed disks and staggering avoids network bottleneck in the distributed RAID. We demonstrate how to reduce the checkpointing overhead and increase the availability by striping and staggering dynamically for communication intensive applications. Linpack HPC Benchamark and MPI programs are applied to these checkpointing schemes for performance evaluation on the 16-nodes cluster system. Benchmark results prove the benefits of the striped checkpointing scheme compare to the existing schemes, and these results are useful to design the efficient checkpointing scheme for fast rollback recovery from any single node failure in a cluster system.
Keywords
Striped Checkpointing Scheme; Checkpoint; Cluster Computer; Distributed RAID;
Citations & Related Records
연도 인용수 순위
  • Reference
1 K. Hwang, H. Jin, R. Ho and W. Ro, 'Reliable Cluster Computing with a New Checkpointing RAID-x Architecture,' Proceedings of 9-th Workshop on Heterogeneous Computing, Cancum, Mexico, 2000   DOI
2 K. Hwang, H. Jin and R. Ho, 'RAID-x : A New Distributed Disk Array for I/O-Centric Cluster Computing,' Proceedings of 9th High-Performance Distributed Computing Symposium, Pittsburgh, 2000   DOI
3 K. Hwang, H. Jin, E. Chow, C. Wang and Z. Xu, 'Designing SSI Clusters with Hierarchical Checkpointing and Single IO Space,' IEEE Concurrency Magazine, 1999   DOI   ScienceOn
4 E. Elnozahy and W. Zwaenepoel, 'On the Use and Implementation of Message Logging,' Proceedings of 24th International Symposium on Fault-Tolerant Computing, 1994   DOI
5 J. Plank, M. Beck, G. Kingsley and K. Li, 'Libckpt : Transparent Checkpointing Under UNIX,' Proceedings of USE NIX Winter 1995 Technical Conference, 1995
6 K. Hwang and Z. Xu, 'Scalable Parallel Computing,' McGraw-Hill, 2000
7 G. Cao and M. Singhal, 'On Coordinated Checkpointing in Distributed Systems,' IEEE Transactions on Parallel and Distributed Systems, Vol.9, No.12, 1998   DOI   ScienceOn
8 J. Plant, K. Li and M. Puening, 'Diskless Checkpointing,' IEEE Transactions on parallel and Distributed Systems, 1998   DOI   ScienceOn
9 N. Vaidya, 'Staggered Consistent Checkpointing,' IEEE Transactions on parallel and Distributed Systems, Vol.10, No.7, 1999   DOI   ScienceOn