Browse > Article

Design and Implementation of a User-based MPI Checkpointer for Portability  

Ahn Sun-Il (한국과학기술정보연구원)
Han Sang-Yong (서울대학교 컴퓨터공학부)
Abstract
An MPI Checkpointer is a tool which provides fault-tolerance through checkpointing The previous researches related to the MPI checkpointer have focused on automatic checkpointing and recovery capabilities, but they haven't considered portability issues. In this paper, we discuss design and implementation issues considered for portability when we developed an MPI checkpointer called STFT. In order to increase portability, firstly STFT supports the abstraction interface for a single process checkpointer. Secondly, STFT uses a user-based checkpointing method, and limits possible checkpointing places a user can make. Thirdly, STFT lets the MPI_Init create network connections to the other MPI processes in a fixed order. With these features, we expect STFT can be easily adaptable to various platforms and MPI implementations, and confirmed STFT is easily adaptable to LAM and MPICH/P4 with the prototype Implementation.
Keywords
MPI; checkpointing; portability;
Citations & Related Records
연도 인용수 순위
  • Reference
1 T. Tannenbaum, and M. Litzkow, 'Checkpointing and migration of Unix processes in the Condor distributed system,' D. Dobbs Journal, pp.40-48, Feb. 1995
2 K. M. Chandy, and L. Lamport, 'Distributed snapshots: Determining global states of distributed system,' ACM Trans. On Computer Systems, 3(1):pp.63-75, Feb. 1985   DOI   ScienceOn
3 J.S. Plank, 'Efficient Checkpointing on MIMD Architectures,' PhD. thesis, Princeton University, June 1993
4 M. Hayden, 'The Ensmble System,' Doctoral dissertation, Cornell University, Dept. Computer Sciences, 1997
5 G.F. Fagg, and J.J. Dongara, 'FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world,' EuroPVM/MPI User's Group Meeting 2000, Springer-Verilag, pp.346-353, 2000
6 A. Agbaria, and R. Friedman, 'Starfish: Faulttolerant Dynamic MPI programs on cluster of workstations,' Eighth IEEE International Symposium on High Performance Distributed Computing, 1999
7 MPI Forum, 'MPI: A message-passing interface standard,' International Journal of Supercomputer Applications, 8(3/4):pp,165-414, 1994
8 G. Bums, R. Daoud, and J. Vaigl, 'LAM: An open cluster environment for MPI,' In Proc. Of Supercomp. Symp., 1994
9 W. Gropp, E. Lusk, N. Doss, and A. Skjellurn, 'MPICH: A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,' Parallel computing, Vol. 22, No.6, pp.789-828, Sep 1996   DOI   ScienceOn
10 MPI Software Technology, Inc., 'MPI/Pro,' http://mpi-softtech.com/, 1999
11 G. Stellner, 'CoCheck: Checkpointing and Process Migration for MPI,' Proc. Of the International Parallel Processing Symposium, IEEE Computer Soc. Press, pp.526-531, 1996   DOI
12 Sriram Lorenzo Alvisi, and Harrick M., 'Egida: An Extensible Toolkit For Low-overhead Fault-Tolerance,' Symposium on Fault-Tolerant Computing, 1999   DOI
13 W. Gropp, S. Husss-Lederman, et aI., 'MPI-The Complete Reference, Vol-2, The MPI Extensions,' ISBN, MIT Press, 1998
14 Victor C. Zandy, Barton P. Miller, and Miron Livny, 'Process Hijacking,' The Eighth IEEE International Symposium on High Performance Distributed Computing (HPDC'99), pp.177-184, August 1999
15 J. S. Plank, M. Beck, G. Kingsley, and K. Li., 'Libckpt: Transparent Checkpointing under Unix,' In Usenix Winter 1995 Technical Conference, pp.213-223, January, 1995
16 David Baile, et al., 'The nas parallel benchmarks 2.0,' Technical Report, NSA-95-020 Ames Research Center, December 1995
17 Ian Foster, and Carl Kesselman, The Grid: Blueprint for a New Computing Infrastructure, MK Publications, 1999
18 Y. Chen, J. S. Plank, and Kai Li, 'CLIP: A Checkpointing Tool for Message-Passing Parallel Programs,' Proceedings of the ACM/IEEE conference on Supercomputing, 1997   DOI
19 George Bosilca, et al., 'MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,' In Proceedings of SC2002. IEEE, 2002   DOI
20 Rajanikanth Batchu, et al., 'MPl/FT: Architecture and Taxonomies for Fault-Tolerant, MessagePassing Middle ware for Performance-Portable Parallel Computing,' 1st International Symposium on Cluster Computing and the Grid, 2001