Browse > Article
http://dx.doi.org/10.9728/dcs.2015.16.5.691

Improving the Job Success Rate through Analysis of User Logs in HPC  

Yoon, JunWeon (Dept. of Supercomputing Center, KISTI)
Hong, TaeYoung (Dept. of Supercomputing Center, KISTI)
Kong, Ki-Sik (Dept. of Multimedia, Namseoul University)
Park, ChanYeol (Dept. of Supercomputing Center, KISTI)
Publication Information
Journal of Digital Contents Society / v.16, no.5, 2015 , pp. 691-697 More about this Journal
Abstract
Supercomputers are used for many different areas including new product design of industries as well as state-of-the-art science and technology for large amount of computational needs. Tachyon is a 4th supercomputer built at KISTI that is a high-performance parallel computing system with 3,200 computing nodes and infrastructures. This system is currently about 10,000 users and over 170 organizations are used, the number of jobs they are performing work in batch type form through a scheduler. Also, this system logs lots of job scripts, execution environment, library, job status from the job submit to end. In this paper, we analyzed batch jobs information from Sun Grid Engine, that use as a scheduler in Tachyon system, and job executed information in Tachyon System. In particular, we distinguished the fail jobs from the all tasks that users perform and we analyzed the cause of failure. Among them, we can extracted some of jobs that can be regarded as normal jobs through the improvement in those works logged as all of fail jobs.
Keywords
HPC; Supercomputer; Scheduler; Batch job; Log Analysis;
Citations & Related Records
연도 인용수 순위
  • Reference
1 National Institute of Supercomputing and Networking, KISTI, http://www.nisn.re.kr
2 F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, "Understanding lustre filesystem internals", Oak Ridge National Lab, Technical Report ORNL/TM-2009/117, 2009
3 G. Pfister, "An Introduction to the InfiniBand Architecture (http://www.infinibandta.org/)", IEEE Press, 2001.
4 G. Cawood, T. Seed, R. Abrol, T. Sloan, "TGO & JOSH:Grid Scheduling with Grid Engine & Globus", Proceedings of the UK e-Science All Hands Meetings, Nottingham, 2004.
5 Templeton, D., "A Beginner's Guide to Sun Grid Engine 6.2", Whitepaper of Sun Microsystems, July 2009.
6 Stillwell, M.; Vivien, F.; Casanova, H., "Dynamic Fractional Resource Scheduling versus Batch Scheduling," Parallel and Distributed Systems, IEEE Transactions, vol.23, no.3, pp.521-529, March 2012.   DOI
7 C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA. http://www.sun.com/blueprints/1005/819-4325.pdf, 2005.
8 J.H. Abawajy, "An efficient adaptive scheduling policy for high-performance computing", Original Research Article Future Generation Computer Systems, Volume 25, Issue 3, pp.364-370, Mar 2009.   DOI
9 J. W. Yoon, T. Y. Hong, C. Y. Park, H.C. Yu, "Analysis of Batch Job log to improve the success rate in HPC Environment", International Conference on Convergence Technology, vol.2 No.1, pp.209-210, July,2013.
10 El-Sayed, N., & Schroeder, B.., "Reading between the lines of failure logs: Understanding how HPC systems fail". In: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pp.1-12, June, 2013.