DOI QR코드

DOI QR Code

Reducing Overhead of Distributed Checkpointing with Group Communication

  • Ahn, Jinho (School of Computer Science and Engineering, Kyonggi University)
  • Received : 2020.12.12
  • Accepted : 2020.12.30
  • Published : 2020.12.31

Abstract

A protocol HMNR, was proposed to utilize control information of every other process piggybacked on each sent message for minimizing the number of forced checkpoints. Then, an improved protocol, called Lazy-HMNR, was presented to lower the possibility of taking forced checkpoints incurred by the asymmetry between checkpointing frequencies of processes. Despite these two different minimization techniques, if the high message interaction traffic occurs, Lazy-HMNR may considerably lower the probability of knowing whether there occurs no Z-cycle due to its shortcomings. Also, we recognize that no previous work has smart procedures to be able to utilize network infrastructures for highly decreasing the number of forced checkpoints with dependency information carried on every application message. We introduce a novel Lazy-HMNR protocol for group communication-based distributed computing systems to cut back the number of forced checkpoints in a more effective manner. Our simulation outcomes showed that the proposed protocol may highly lessen the frequency of forced checkpoints by comparison to Lazy-HMNR.

Keywords

References

  1. L. Bautista-Gomez, T. Ropars, N. Maruyama and S. Matsuoka, "Hierarchical clustering strategies for fault tolerance in large scale HPC systems", In Proc. of the IEEE International Conference on Cluster Computing, pp. 355-363, 2012.
  2. W. Bland, A. Bouteiller, T. Herault, G. Bosilca and J.J. Dongarra, "Post-failure recovery of MPI communication capability: design and rationale", The International Journal of High Performance Computing Applications, Vol. 27, No. 3, pp. 244-254, 2013. https://doi.org/10.1177/1094342013488238
  3. R. D. Schlichting and F. B. Schneider, "Fail-stop processors: an approach to designing fault-tolerant distributed computing systems", ACM Transactions on Computer Systems, Vol. 1, No. 3, pp. 222-238, 1985. https://doi.org/10.1145/357369.357371
  4. S. Di, L. Bautista-Gomez and F. Cappello, "Optimization of multi-level checkpoint model with uncertain execution scales", In Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 907-918, 2014.
  5. T. Ropars and C. Morin, "Active optimistic and distributed message logging for message-passing applications", Concurrency and Computation: Practice and Experience, Vol. 23, No. 17, pp. 2167-2178, 2011. https://doi.org/10.1002/cpe.1775
  6. L. Lamport, "Time, clocks, and the ordering of events in a distributed system", Communications of the ACM, Vol. 21, No. 7, pp. 558-565, 1978. https://doi.org/10.1145/359545.359563
  7. M. G. Estahbanati and F. Schintke, "Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources", in Proc. of the 38th Symposium on Reliable Distributed Systems (SRDS), pp. 143 - 152, 2019.
  8. H. Mansouri and A. Pathan, "Checkpointing distributed computing systems: an optimisation approach", International Journal of High Performance Computing and Networking, Vol. 15, No. 3/4, pp. 202-209, 2019. https://doi.org/10.1504/IJHPCN.2019.106109
  9. E. Elnozahy, L. Alvisi, Y. Wang and D. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ACM Computing Surveys, Vol. 34, No. 3, pp 375-408, 2002. https://doi.org/10.1145/568522.568525
  10. K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems", ACM Transactions on Computer Systems, Vol. 3, No. 1, pp. 63-75, 1985. https://doi.org/10.1145/214451.214456
  11. I. C. Garcia, G. M. D. Vieira, and L. E. Buzato, "A Rollback in the History of Communication-Induced Checkpointing", submitted(arXiv:1702.06167 [cs.DC]), Feb. 2017.
  12. J. -M. Helary, A. Mostefaoui, R.H.B. Netzer, and M. Raynal, "Communication-based prevention of useless checkpoints in distributed computations", Distributed Computing, Vol. 13, No. 1, pp. 29-43, 2000. https://doi.org/10.1007/s004460050003
  13. Y. Luo and D. Manivannan, "FINE: a Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems", Journal of Parallel and Distributed Computing, Vol. 69, No. 2, pp. 153-167, 2009. https://doi.org/10.1016/j.jpdc.2008.07.012
  14. C. Simon, A. Calixto, S. E. P. Hernandez, and J. R. Perez Cruz, "A delayed checkpoint approach for communication-induced checkpointing in autonomic computing", in Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on. IEEE, pp. 56-61, 2013.
  15. J. Tsai, "Applying the Fully-Informed Checkpointing Protocol to the Lazy Indexing Strategy", Journal of Information Science and Engineering, Vol. 23, No. 5, pp. 1611-1621, 2007.
  16. G. M. Vieira, I. C. Garcia, and L. E. Buzato, "Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation", in SCTF '01: Proc. of the IX Brazilian Symposium on Fault-Tolerant Computing, Florian 'opolis, Santa Catarina, Brazil, pp. 31-42, 2001.
  17. R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots", IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 2, pp. 165-169, 1995. https://doi.org/10.1109/71.342127
  18. R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin and H. Y. Song, "Parsec: a parallel simulation environments for complex systems", IEEE Computer, Vol. 31, No. 10, pp. 77-85, 1998.