Scalable Approach to Failure Analysis of High-Performance Computing Systems

Shawky, Doaa;

doi:10.4218/etrij.14.0113.1133

ETRI Journal

Volume 36 Issue 6
/
Pages.1023-1031
/
2014
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Scalable Approach to Failure Analysis of High-Performance Computing Systems

Shawky, Doaa (Department of Engineering Mathematics, Cairo University)

Received : 2013.11.12
Accepted : 2014.05.07
Published : 2014.12.01

https://doi.org/10.4218/etrij.14.0113.1133 Citation PDF KSCI KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

Keywords

References

The Raw Failure Data. Los Alamos National Laboratory. Accessed May 14, 2012. http://www.lanl.gov/projects/computersciencedata/, 2012.
B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," IEEE Trans. Dependable Secure Comput., vol. 7, no. 4, Oct.-Dec. 2010, pp. 337-350. https://doi.org/10.1109/TDSC.2009.4
B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance-Computing Systems," Proc. Dependable Syst. Netw., Pennsylvania, PA, USA, June 25-28, 2006, pp. 249-258.
R.K. Sahoo et al., "Failure Data Analysis of a Large-Scale Heterogeneous Server Environment," Proc. Dependable Syst. Netw., Florence, Italy, June 28- July 1, 2004, pp. 772-781.
L. Yawei et al., "Fault-Aware Runtime Strategies for High-Performance Computing," IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, Apr. 2009, pp. 460-473. https://doi.org/10.1109/TPDS.2008.128
N. Nakka, A. Agrawal, and A. Choudhary, "Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs," IEEE Int. Symp. Parallel Distrib. Process. Workshops Phd Forum, Shanghai, China, May 16-20, 2011, pp. 1557-1566.
G. Gibson, B. Schroeder, and J. Digney, "Failure Tolerance in Petascale Computers," CTWatch Quarterly, vol. 3, no. 4, Nov. 2007, pp. 4-10.
S.E. Michalak et al., "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer," IEEE Trans. Device Mater. Rel., vol. 5, no. 3, Sept. 2005, pp. 329-335. https://doi.org/10.1109/TDMR.2005.855685
Y. Yuan et al., "Job Failures in High Performance Computing Systems: A Large-Scale Empirical Study," Comput. Math. Appl., vol. 63, no. 2, Jan. 2012, pp. 365-377. https://doi.org/10.1016/j.camwa.2011.07.040
H. Jin and X.-H. Sun, "Performance Comparison under Failures of MPI and MapReduce: An Analytical Approach," Future Generation Comput. Syst., vol. 29, no. 7, Sept. 2013, pp. 1808-1815. https://doi.org/10.1016/j.future.2013.01.013
M. Sharifi and S.A. Hamedi, "Failure Prediction Mechanisms in Cluster Systems," Int. Conf. Biocomput., Bioinformat. Biomed. Technol., Bucharest, Romania, June 29-July 5, 2008, pp. 23-28.
N. Taerat et al., "Proficiency Metrics for Failure Prediction in High Performance Computing," Int. Symp. Parallel Distrib. Process. Appl., Taipei, Taiwan, Sept. 6-9, 2010, pp. 491-498.
J. Brandt et al., "Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example," Int. Conf. Dependable Syst. Netw. Workshops, Chicago, IL, USA, June 28-July 1, 2010, pp. 2-7.
Z. Pawlak, "Rough Sets," Int. J. Comput. Inf. Sci., vol. 11, no. 5, Oct. 1982, pp. 341-356. https://doi.org/10.1007/BF01001956
G. Alfredo et al., "A New Proposal for Multi-objective Optimization Using Differential Evolution and Rough Sets Theory," Genetic Evol. Comput. Conf., Seattle, WA, USA, July 8-12, 2006, pp. 675-682.
D. Shawky, " The Application of Rough Sets Theory as a Tool for Analyzing Dynamically Collected Data," J. Eng. Appl. Sci., Cairo University, vol. 55, no. 6, Nov. 2008, pp. 473-490.
B. Suraj, "Rough Set Methods for the Synthesis and Analysis of Concurrent Processes," in Studies Fuzziness Soft Comput., Heidelberg, Germany: Springer, 2000, pp. 379-488.
J. Komorowski et al., "Rough Sets: A Tutorial," in Rough Fuzzy Hybridization: A New Trend in Decision Making, Singapore: Springer, 1999, pp. 3-98.
J. Liang, Z. Shi, and D. Li, "Applications of Inclusion Degree in Rough Set Theory," Int. J. Comput. Cognition, vol. 1, no. 2, June 2003, pp. 67-78.
Z. Pawlak, "Rough Sets" in Rough Sets Data Mining, Dordrecht, Netherlands: Kluwer Academic Publisher, 1997, pp. 3-7.
J. Hampton, "Rough Set Theory: The Basics (Part 1)," J. Comput. Intell. Finance, vol. 5, no. 6, Jan.-Feb. 1997, pp. 25-29.
X. Hu, T. Lin, and J. Han, "A New Rough Sets Model Based on Database Systems," Fundam. Informat., vol. 59, no. 2-3, Apr. 2004, pp. 125-152.
ROSE2, Rough Sets Data Explorer. Laboratory of Intelligent Decision Support Systems. Poznan University of Technology, Poland. Accessed Jan. 22, 2012. http://idss.cs.put.poznan.pl/site/

Cited by

Analytical modelling and optimization analysis of large-scale communication systems and networks with repairmen policy vol.100, pp.5, 2014, https://doi.org/10.1007/s00607-017-0580-7

ETRI Journal

Scalable Approach to Failure Analysis of High-Performance Computing Systems

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)