[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.9728/dcs.2015.16.5.691

Improving the Job Success Rate through Analysis of User Logs in HPC

Yoon, JunWeon (Dept. of Supercomputing Center, KISTI)
Hong, TaeYoung (Dept. of Supercomputing Center, KISTI)
Kong, Ki-Sik (Dept. of Multimedia, Namseoul University)
Park, ChanYeol (Dept. of Supercomputing Center, KISTI)

Publication Information

Journal of Digital Contents Society / v.16, no.5, 2015 , pp. 691-697 More about this Journal

Abstract

Supercomputers are used for many different areas including new product design of industries as well as state-of-the-art science and technology for large amount of computational needs. Tachyon is a 4th supercomputer built at KISTI that is a high-performance parallel computing system with 3,200 computing nodes and infrastructures. This system is currently about 10,000 users and over 170 organizations are used, the number of jobs they are performing work in batch type form through a scheduler. Also, this system logs lots of job scripts, execution environment, library, job status from the job submit to end. In this paper, we analyzed batch jobs information from Sun Grid Engine, that use as a scheduler in Tachyon system, and job executed information in Tachyon System. In particular, we distinguished the fail jobs from the all tasks that users perform and we analyzed the cause of failure. Among them, we can extracted some of jobs that can be regarded as normal jobs through the improvement in those works logged as all of fail jobs.

Keywords

HPC; Supercomputer; Scheduler; Batch job; Log Analysis;

Citations & Related Records

Reference

1	National Institute of Supercomputing and Networking, KISTI, http://www.nisn.re.kr
2	F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, "Understanding lustre filesystem internals", Oak Ridge National Lab, Technical Report ORNL/TM-2009/117, 2009
3	G. Pfister, "An Introduction to the InfiniBand Architecture (http://www.infinibandta.org/)", IEEE Press, 2001.
4	G. Cawood, T. Seed, R. Abrol, T. Sloan, "TGO & JOSH:Grid Scheduling with Grid Engine & Globus", Proceedings of the UK e-Science All Hands Meetings, Nottingham, 2004.
5	Templeton, D., "A Beginner's Guide to Sun Grid Engine 6.2", Whitepaper of Sun Microsystems, July 2009.
6	Stillwell, M.; Vivien, F.; Casanova, H., "Dynamic Fractional Resource Scheduling versus Batch Scheduling," Parallel and Distributed Systems, IEEE Transactions, vol.23, no.3, pp.521-529, March 2012. DOI
7	C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA. http://www.sun.com/blueprints/1005/819-4325.pdf, 2005.
8	J.H. Abawajy, "An efficient adaptive scheduling policy for high-performance computing", Original Research Article Future Generation Computer Systems, Volume 25, Issue 3, pp.364-370, Mar 2009. DOI
9	J. W. Yoon, T. Y. Hong, C. Y. Park, H.C. Yu, "Analysis of Batch Job log to improve the success rate in HPC Environment", International Conference on Convergence Technology, vol.2 No.1, pp.209-210, July,2013.
10	El-Sayed, N., & Schroeder, B.., "Reading between the lines of failure logs: Understanding how HPC systems fail". In: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pp.1-12, June, 2013.

8	(2018) Journal of Environmental Science International Optimization of the computing environment to improve the speed of the modeling (WRF and CMAQ) calculation of the National Air Quality Forecast System / 27 (8) , 723
7	(2015) Applied sciences Log Analysis-Based Resource and Execution Time Improvement in HPC: A Case Study / 10 (7) , 2634

KSCI

Improving the Job Success Rate through Analysis of User Logs in HPC HPC 환경에서 사용자 로그 분석을 통한 작업 성공률 개선

Improving the Job Success Rate through Analysis of User Logs in HPC