References
- L. Bautista-Gomez, T. Ropars, N. Maruyama and S. Matsuoka, "Hierarchical clustering strategies for fault tolerance in large scale HPC systems", In Proc. of the IEEE International Conference on Cluster Computing, pp. 355-363, 2012.
- W. Bland, A. Bouteiller, T. Herault, G. Bosilca and J.J. Dongarra, "Post-failure recovery of MPI communication capability: design and rationale", The International Journal of High Performance Computing Applications, Vol. 27, No. 3, pp. 244-254, 2013. https://doi.org/10.1177/1094342013488238
- R. D. Schlichting and F. B. Schneider, "Fail-stop processors: an approach to designing fault-tolerant distributed computing systems", ACM Transactions on Computer Systems, Vol. 1, No. 3, pp. 222-238, 1985. https://doi.org/10.1145/357369.357371
- S. Di, L. Bautista-Gomez and F. Cappello, "Optimization of multi-level checkpoint model with uncertain execution scales", In Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 907-918, 2014.
- T. Ropars and C. Morin, "Active optimistic and distributed message logging for message-passing applications", Concurrency and Computation: Practice and Experience, Vol. 23, No. 17, pp. 2167-2178, 2011. https://doi.org/10.1002/cpe.1775
- L. Lamport, "Time, clocks, and the ordering of events in a distributed system", Communications of the ACM, Vol. 21, No. 7, pp. 558-565, 1978. https://doi.org/10.1145/359545.359563
- M. G. Estahbanati and F. Schintke, "Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources", in Proc. of the 38th Symposium on Reliable Distributed Systems (SRDS), pp. 143 - 152, 2019.
- H. Mansouri and A. Pathan, "Checkpointing distributed computing systems: an optimisation approach", International Journal of High Performance Computing and Networking, Vol. 15, No. 3/4, pp. 202-209, 2019. https://doi.org/10.1504/IJHPCN.2019.106109
- E. Elnozahy, L. Alvisi, Y. Wang and D. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ACM Computing Surveys, Vol. 34, No. 3, pp 375-408, 2002. https://doi.org/10.1145/568522.568525
- K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems", ACM Transactions on Computer Systems, Vol. 3, No. 1, pp. 63-75, 1985. https://doi.org/10.1145/214451.214456
- I. C. Garcia, G. M. D. Vieira, and L. E. Buzato, "A Rollback in the History of Communication-Induced Checkpointing", submitted(arXiv:1702.06167 [cs.DC]), Feb. 2017.
- J. -M. Helary, A. Mostefaoui, R.H.B. Netzer, and M. Raynal, "Communication-based prevention of useless checkpoints in distributed computations", Distributed Computing, Vol. 13, No. 1, pp. 29-43, 2000. https://doi.org/10.1007/s004460050003
- Y. Luo and D. Manivannan, "FINE: a Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems", Journal of Parallel and Distributed Computing, Vol. 69, No. 2, pp. 153-167, 2009. https://doi.org/10.1016/j.jpdc.2008.07.012
- C. Simon, A. Calixto, S. E. P. Hernandez, and J. R. Perez Cruz, "A delayed checkpoint approach for communication-induced checkpointing in autonomic computing", in Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on. IEEE, pp. 56-61, 2013.
- J. Tsai, "Applying the Fully-Informed Checkpointing Protocol to the Lazy Indexing Strategy", Journal of Information Science and Engineering, Vol. 23, No. 5, pp. 1611-1621, 2007.
- G. M. Vieira, I. C. Garcia, and L. E. Buzato, "Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation", in SCTF '01: Proc. of the IX Brazilian Symposium on Fault-Tolerant Computing, Florian 'opolis, Santa Catarina, Brazil, pp. 31-42, 2001.
- R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots", IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 2, pp. 165-169, 1995. https://doi.org/10.1109/71.342127
- R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin and H. Y. Song, "Parsec: a parallel simulation environments for complex systems", IEEE Computer, Vol. 31, No. 10, pp. 77-85, 1998.