Browse > Article

CTS: A Cluster System Test Suite for Preventive Maintenance  

차광호 (한국과학기술정보연구원)
Abstract
Cluster systems have been widely used for solving problems in various application domains, and regarded as useful high performance computing resources. As the number of cluster system user is increasing, it is no less important to maintain stable operation than to improve cluster system performance. Although hardware preventive maintenance is important for keeping normal operation, the testing tool which can be used for general cluster systems during maintenance has received little attention. In this Paper, considering hardware Preventive maintenance, we suggest a testing tool for hardware of cluster system. The cluster system testing tool which is named CTS(Cluster system Test Suite) has two check routines; one for memory, and the other for NIC respectively. The CTS is designed to support the common features of general cluster systems and all the Jobs such as setting test conditions to querying the results can be done entirely within an integrated GUI environment. CTS is used as the testing tool for two kinds of cluster systems during maintenance, and the experimental results show that CTS reports useful information for cluster systems management.
Keywords
Cluster System; Preventive Maintenance; Hardware Test;
Citations & Related Records
연도 인용수 순위
  • Reference
1 James H. Laros III, Lee Ward, Nathan W. Dauchy, Ron Brightwell, Trammell Hudson, and Ruth Klundt, 'An Extensible, Portable, Scalable, Cluster Management Software Architecture,' Proc. of IEEE International Conference on Cluster Computing, pp 287-295, 2002   DOI
2 The Computational Plant Project, http://www.cs.sandia.gov/cplant
3 The Parallel Tools Consortium, http://www.ptools.org
4 IBM Redbook, 'Building a Linux HPC Cluster with xCAT,' 2002, http://publib-b.boulder.ibm.com/Redbooks.nsf/0/7b1ce6b3913cafb386256bdb007595e8?OpenDocument& Highlight=0,SG24-6623-00
5 MySQL Website, http://www.mysql.com
6 James H. Laros III, Lee Ward, Nathan W. Dauchy, James Vasak, Ruth Klundt, Glen Laguna, Marcus Epperson, and Jon R. Stearley, 'The Cluster Integration Toolkit - An Extensible, Portable, Scafable Cluster Management Software Implementation,' Proc of 1st Cluster World Conference and Expo, pp 23-26, 2003
7 IBM Redbook, 'Implementing Systems Management Solutlons using IBM Director,' 2003, http://publib-b.boulder.ibm.com/Redbooks.nsf/9445fa5b416f6e32852569ae006bb65f/59299a2cbl2fea3f85256c75004e2dd3?OpenDocument
8 Charies Cazabon, 'Memtester,' http://www.qcc.ca/charlesc/software/memtester
9 Linux Ethercard Status, Diagnostic and Setup Utilities, http://web.archive.org/web/20030608223511/www.scyld,com/diag
10 Clusters@Top500, http://clusters.top500.org/
11 Memtest86 - A Stand alone Memory Diagnostic, http://www.memtest86.com
12 Michael D. Crawford, 'Using Test Suites to Validate the Linux Kernel,' http://linuxquahty.sunsite.dk/articles/testsuites
13 SCore Cluster System Software 5.6 Documents, http://pdswww.rwcp.or.jp/score/dist/score/html/en/index.html
14 HP Integrated Lights-Out Advanced, http://h18013.www1.hp.com/products/servers/management/iloadv/index.html
15 Jun Xu, Zbigniew Kalbarczyk, and Ravishankar K. Iyer, 'Networked Windows NT system field failure data analysis,' Proc. of Pacific Rim International Symposium on Dependable Computuing, pp. 178-185, 1999   DOI
16 Rocks Cluster Distribution : An Open Source High Performance Linux Cluster Solution, http://www.rocksclusters.org/Rocks
17 OSCAR : Open Source Cluster Application Ressources, http://oscar.openclustergroup.org/tiki-index.php
18 V. Biscaglia, C. Malaguti, and M. Paoletti Gualandi, 'Maintenance planning on MV distribution network,' IEE Conference Publication No. 438(14th International Conference and Exhibition on Electricity Distribution. Part 1. Contributions.), Vol. 3(19), pp. 1-4, 1997
19 TOP 500 Supercomputer sites, http://www.top500.org
20 M. Kalyanakrish, Z. Kalbarczyk, and R. Iyer, 'Failure data analysis of a LAN of Windows NT based computers,' Proc. of the 18th IEEE Symposium on Reliable Distributed Systems, pp. 178-187, 1999   DOI
21 IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990(1991 Corrected Edition), The Institute of Electrical and Electronics Engineers, Inc., 1994
22 Mira Kajko-Mattsson, 'Can we learn anything from hardware preventive maintenance?,' Proc. of Seventh IEEE International Conference on Engineering of Complex Computer Systems, pp. 106-111, 2001   DOI