Reducing Overhead of Distributed Checkpointing with Group Communication

Ahn, Jinho;

doi:10.14801/JAITC.2020.10.2.83

Journal of Advanced Information Technology and Convergence (한국정보기술학회 영문논문지)

Volume 10 Issue 2
/
Pages.83-90
/
2020
/
2234-1072(pISSN)
/
2234-0963(eISSN)

Korean Institute of Information Technology (한국정보기술학회)

DOI QR Code

Reducing Overhead of Distributed Checkpointing with Group Communication

Ahn, Jinho (School of Computer Science and Engineering, Kyonggi University)

Received : 2020.12.12
Accepted : 2020.12.30
Published : 2020.12.31

https://doi.org/10.14801/JAITC.2020.10.2.83 Citation

⟨ Previous Next ⟩

Abstract

A protocol HMNR, was proposed to utilize control information of every other process piggybacked on each sent message for minimizing the number of forced checkpoints. Then, an improved protocol, called Lazy-HMNR, was presented to lower the possibility of taking forced checkpoints incurred by the asymmetry between checkpointing frequencies of processes. Despite these two different minimization techniques, if the high message interaction traffic occurs, Lazy-HMNR may considerably lower the probability of knowing whether there occurs no Z-cycle due to its shortcomings. Also, we recognize that no previous work has smart procedures to be able to utilize network infrastructures for highly decreasing the number of forced checkpoints with dependency information carried on every application message. We introduce a novel Lazy-HMNR protocol for group communication-based distributed computing systems to cut back the number of forced checkpoints in a more effective manner. Our simulation outcomes showed that the proposed protocol may highly lessen the frequency of forced checkpoints by comparison to Lazy-HMNR.

Keywords

References

L. Bautista-Gomez, T. Ropars, N. Maruyama and S. Matsuoka, "Hierarchical clustering strategies for fault tolerance in large scale HPC systems", In Proc. of the IEEE International Conference on Cluster Computing, pp. 355-363, 2012.
W. Bland, A. Bouteiller, T. Herault, G. Bosilca and J.J. Dongarra, "Post-failure recovery of MPI communication capability: design and rationale", The International Journal of High Performance Computing Applications, Vol. 27, No. 3, pp. 244-254, 2013. https://doi.org/10.1177/1094342013488238
R. D. Schlichting and F. B. Schneider, "Fail-stop processors: an approach to designing fault-tolerant distributed computing systems", ACM Transactions on Computer Systems, Vol. 1, No. 3, pp. 222-238, 1985. https://doi.org/10.1145/357369.357371
S. Di, L. Bautista-Gomez and F. Cappello, "Optimization of multi-level checkpoint model with uncertain execution scales", In Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 907-918, 2014.
T. Ropars and C. Morin, "Active optimistic and distributed message logging for message-passing applications", Concurrency and Computation: Practice and Experience, Vol. 23, No. 17, pp. 2167-2178, 2011. https://doi.org/10.1002/cpe.1775
L. Lamport, "Time, clocks, and the ordering of events in a distributed system", Communications of the ACM, Vol. 21, No. 7, pp. 558-565, 1978. https://doi.org/10.1145/359545.359563
M. G. Estahbanati and F. Schintke, "Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources", in Proc. of the 38th Symposium on Reliable Distributed Systems (SRDS), pp. 143 - 152, 2019.
H. Mansouri and A. Pathan, "Checkpointing distributed computing systems: an optimisation approach", International Journal of High Performance Computing and Networking, Vol. 15, No. 3/4, pp. 202-209, 2019. https://doi.org/10.1504/IJHPCN.2019.106109
E. Elnozahy, L. Alvisi, Y. Wang and D. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ACM Computing Surveys, Vol. 34, No. 3, pp 375-408, 2002. https://doi.org/10.1145/568522.568525
K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems", ACM Transactions on Computer Systems, Vol. 3, No. 1, pp. 63-75, 1985. https://doi.org/10.1145/214451.214456
I. C. Garcia, G. M. D. Vieira, and L. E. Buzato, "A Rollback in the History of Communication-Induced Checkpointing", submitted(arXiv:1702.06167 [cs.DC]), Feb. 2017.
J. -M. Helary, A. Mostefaoui, R.H.B. Netzer, and M. Raynal, "Communication-based prevention of useless checkpoints in distributed computations", Distributed Computing, Vol. 13, No. 1, pp. 29-43, 2000. https://doi.org/10.1007/s004460050003
Y. Luo and D. Manivannan, "FINE: a Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems", Journal of Parallel and Distributed Computing, Vol. 69, No. 2, pp. 153-167, 2009. https://doi.org/10.1016/j.jpdc.2008.07.012
C. Simon, A. Calixto, S. E. P. Hernandez, and J. R. Perez Cruz, "A delayed checkpoint approach for communication-induced checkpointing in autonomic computing", in Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on. IEEE, pp. 56-61, 2013.
J. Tsai, "Applying the Fully-Informed Checkpointing Protocol to the Lazy Indexing Strategy", Journal of Information Science and Engineering, Vol. 23, No. 5, pp. 1611-1621, 2007.
G. M. Vieira, I. C. Garcia, and L. E. Buzato, "Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation", in SCTF '01: Proc. of the IX Brazilian Symposium on Fault-Tolerant Computing, Florian 'opolis, Santa Catarina, Brazil, pp. 31-42, 2001.
R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots", IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 2, pp. 165-169, 1995. https://doi.org/10.1109/71.342127
R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin and H. Y. Song, "Parsec: a parallel simulation environments for complex systems", IEEE Computer, Vol. 31, No. 10, pp. 77-85, 1998.

Journal of Advanced Information Technology and Convergence (한국정보기술학회 영문논문지)

Reducing Overhead of Distributed Checkpointing with Group Communication

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)