DOI QR코드

DOI QR Code

Scalable Approach to Failure Analysis of High-Performance Computing Systems

  • Shawky, Doaa (Department of Engineering Mathematics, Cairo University)
  • Received : 2013.11.12
  • Accepted : 2014.05.07
  • Published : 2014.12.01

Abstract

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

Keywords

References

  1. The Raw Failure Data. Los Alamos National Laboratory. Accessed May 14, 2012. http://www.lanl.gov/projects/computersciencedata/, 2012.
  2. B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," IEEE Trans. Dependable Secure Comput., vol. 7, no. 4, Oct.-Dec. 2010, pp. 337-350. https://doi.org/10.1109/TDSC.2009.4
  3. B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance-Computing Systems," Proc. Dependable Syst. Netw., Pennsylvania, PA, USA, June 25-28, 2006, pp. 249-258.
  4. R.K. Sahoo et al., "Failure Data Analysis of a Large-Scale Heterogeneous Server Environment," Proc. Dependable Syst. Netw., Florence, Italy, June 28- July 1, 2004, pp. 772-781.
  5. L. Yawei et al., "Fault-Aware Runtime Strategies for High-Performance Computing," IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, Apr. 2009, pp. 460-473. https://doi.org/10.1109/TPDS.2008.128
  6. N. Nakka, A. Agrawal, and A. Choudhary, "Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs," IEEE Int. Symp. Parallel Distrib. Process. Workshops Phd Forum, Shanghai, China, May 16-20, 2011, pp. 1557-1566.
  7. G. Gibson, B. Schroeder, and J. Digney, "Failure Tolerance in Petascale Computers," CTWatch Quarterly, vol. 3, no. 4, Nov. 2007, pp. 4-10.
  8. S.E. Michalak et al., "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer," IEEE Trans. Device Mater. Rel., vol. 5, no. 3, Sept. 2005, pp. 329-335. https://doi.org/10.1109/TDMR.2005.855685
  9. Y. Yuan et al., "Job Failures in High Performance Computing Systems: A Large-Scale Empirical Study," Comput. Math. Appl., vol. 63, no. 2, Jan. 2012, pp. 365-377. https://doi.org/10.1016/j.camwa.2011.07.040
  10. H. Jin and X.-H. Sun, "Performance Comparison under Failures of MPI and MapReduce: An Analytical Approach," Future Generation Comput. Syst., vol. 29, no. 7, Sept. 2013, pp. 1808-1815. https://doi.org/10.1016/j.future.2013.01.013
  11. M. Sharifi and S.A. Hamedi, "Failure Prediction Mechanisms in Cluster Systems," Int. Conf. Biocomput., Bioinformat. Biomed. Technol., Bucharest, Romania, June 29-July 5, 2008, pp. 23-28.
  12. N. Taerat et al., "Proficiency Metrics for Failure Prediction in High Performance Computing," Int. Symp. Parallel Distrib. Process. Appl., Taipei, Taiwan, Sept. 6-9, 2010, pp. 491-498.
  13. J. Brandt et al., "Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example," Int. Conf. Dependable Syst. Netw. Workshops, Chicago, IL, USA, June 28-July 1, 2010, pp. 2-7.
  14. Z. Pawlak, "Rough Sets," Int. J. Comput. Inf. Sci., vol. 11, no. 5, Oct. 1982, pp. 341-356. https://doi.org/10.1007/BF01001956
  15. G. Alfredo et al., "A New Proposal for Multi-objective Optimization Using Differential Evolution and Rough Sets Theory," Genetic Evol. Comput. Conf., Seattle, WA, USA, July 8-12, 2006, pp. 675-682.
  16. D. Shawky, " The Application of Rough Sets Theory as a Tool for Analyzing Dynamically Collected Data," J. Eng. Appl. Sci., Cairo University, vol. 55, no. 6, Nov. 2008, pp. 473-490.
  17. B. Suraj, "Rough Set Methods for the Synthesis and Analysis of Concurrent Processes," in Studies Fuzziness Soft Comput., Heidelberg, Germany: Springer, 2000, pp. 379-488.
  18. J. Komorowski et al., "Rough Sets: A Tutorial," in Rough Fuzzy Hybridization: A New Trend in Decision Making, Singapore: Springer, 1999, pp. 3-98.
  19. J. Liang, Z. Shi, and D. Li, "Applications of Inclusion Degree in Rough Set Theory," Int. J. Comput. Cognition, vol. 1, no. 2, June 2003, pp. 67-78.
  20. Z. Pawlak, "Rough Sets" in Rough Sets Data Mining, Dordrecht, Netherlands: Kluwer Academic Publisher, 1997, pp. 3-7.
  21. J. Hampton, "Rough Set Theory: The Basics (Part 1)," J. Comput. Intell. Finance, vol. 5, no. 6, Jan.-Feb. 1997, pp. 25-29.
  22. X. Hu, T. Lin, and J. Han, "A New Rough Sets Model Based on Database Systems," Fundam. Informat., vol. 59, no. 2-3, Apr. 2004, pp. 125-152.
  23. ROSE2, Rough Sets Data Explorer. Laboratory of Intelligent Decision Support Systems. Poznan University of Technology, Poland. Accessed Jan. 22, 2012. http://idss.cs.put.poznan.pl/site/

Cited by

  1. Analytical modelling and optimization analysis of large-scale communication systems and networks with repairmen policy vol.100, pp.5, 2014, https://doi.org/10.1007/s00607-017-0580-7