DOI QR코드

DOI QR Code

Improving the Job Success Rate through Analysis of User Logs in HPC

HPC 환경에서 사용자 로그 분석을 통한 작업 성공률 개선

  • Received : 2015.09.18
  • Accepted : 2015.10.22
  • Published : 2015.10.31

Abstract

Supercomputers are used for many different areas including new product design of industries as well as state-of-the-art science and technology for large amount of computational needs. Tachyon is a 4th supercomputer built at KISTI that is a high-performance parallel computing system with 3,200 computing nodes and infrastructures. This system is currently about 10,000 users and over 170 organizations are used, the number of jobs they are performing work in batch type form through a scheduler. Also, this system logs lots of job scripts, execution environment, library, job status from the job submit to end. In this paper, we analyzed batch jobs information from Sun Grid Engine, that use as a scheduler in Tachyon system, and job executed information in Tachyon System. In particular, we distinguished the fail jobs from the all tasks that users perform and we analyzed the cause of failure. Among them, we can extracted some of jobs that can be regarded as normal jobs through the improvement in those works logged as all of fail jobs.

슈퍼컴퓨터는 대량의 계산이 필요한 첨단 과학기술분야의 수치계산뿐만 아니라 산업분야의 신제품 설계 및 개발에도 다양하게 접목되어 사용되고 있다. KISTI의 슈퍼컴퓨터 4호기 Tachyon은 SUN Blade 서버기반으로 구성된 초병렬 컴퓨팅 시스템으로 3,200개의 컴퓨팅 노드와 인프라 노드들로 구분된다. 이 시스템은 현재 만여 명의 사용자와 170여개의 기관이 사용 중에 있으며, 수많은 작업들이 스케줄러를 통해 배치형태로 작업을 수행하고 있다. 또한 Tachyon에서는 작업 제출부터 종료까지 관련된 해당 스크립트와 수행 환경, 라이브러리, 작업수행로그 등을 저장하게 된다. 본 논문에서는 스케줄러로 사용되고 있는 Sun Grid Engine의 배치작업정보와 Tachyon 작업수행로그를 가지고 분석을 진행하였다. 특히, Tachyon에서 사용자가 수행했던 작업 결과 중 실패 작업을 구분하여 원인을 분석하고 그중 일부 작업에 대한 개선을 통해 정상 작업을 추출함으로써 시스템의 전체 성공률을 향상시킬 수 있다.

Keywords

References

  1. National Institute of Supercomputing and Networking, KISTI, http://www.nisn.re.kr
  2. F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, "Understanding lustre filesystem internals", Oak Ridge National Lab, Technical Report ORNL/TM-2009/117, 2009
  3. G. Pfister, "An Introduction to the InfiniBand Architecture (http://www.infinibandta.org/)", IEEE Press, 2001.
  4. G. Cawood, T. Seed, R. Abrol, T. Sloan, "TGO & JOSH:Grid Scheduling with Grid Engine & Globus", Proceedings of the UK e-Science All Hands Meetings, Nottingham, 2004.
  5. Templeton, D., "A Beginner's Guide to Sun Grid Engine 6.2", Whitepaper of Sun Microsystems, July 2009.
  6. Stillwell, M.; Vivien, F.; Casanova, H., "Dynamic Fractional Resource Scheduling versus Batch Scheduling," Parallel and Distributed Systems, IEEE Transactions, vol.23, no.3, pp.521-529, March 2012. https://doi.org/10.1109/TPDS.2011.183
  7. C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA. http://www.sun.com/blueprints/1005/819-4325.pdf, 2005.
  8. J.H. Abawajy, "An efficient adaptive scheduling policy for high-performance computing", Original Research Article Future Generation Computer Systems, Volume 25, Issue 3, pp.364-370, Mar 2009. https://doi.org/10.1016/j.future.2006.04.007
  9. J. W. Yoon, T. Y. Hong, C. Y. Park, H.C. Yu, "Analysis of Batch Job log to improve the success rate in HPC Environment", International Conference on Convergence Technology, vol.2 No.1, pp.209-210, July,2013.
  10. El-Sayed, N., & Schroeder, B.., "Reading between the lines of failure logs: Understanding how HPC systems fail". In: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pp.1-12, June, 2013.

Cited by

  1. Optimization of the computing environment to improve the speed of the modeling (WRF and CMAQ) calculation of the National Air Quality Forecast System vol.27, pp.8, 2018, https://doi.org/10.5322/JESI.2018.27.8.723
  2. Log Analysis-Based Resource and Execution Time Improvement in HPC: A Case Study vol.10, pp.7, 2015, https://doi.org/10.3390/app10072634