CTS: A Cluster System Test Suite for Preventive Maintenance

CTS: 예방 정비를 위한 클러스터 시스템 검사 도구

  • Published : 2004.10.01

Abstract

Cluster systems have been widely used for solving problems in various application domains, and regarded as useful high performance computing resources. As the number of cluster system user is increasing, it is no less important to maintain stable operation than to improve cluster system performance. Although hardware preventive maintenance is important for keeping normal operation, the testing tool which can be used for general cluster systems during maintenance has received little attention. In this Paper, considering hardware Preventive maintenance, we suggest a testing tool for hardware of cluster system. The cluster system testing tool which is named CTS(Cluster system Test Suite) has two check routines; one for memory, and the other for NIC respectively. The CTS is designed to support the common features of general cluster systems and all the Jobs such as setting test conditions to querying the results can be done entirely within an integrated GUI environment. CTS is used as the testing tool for two kinds of cluster systems during maintenance, and the experimental results show that CTS reports useful information for cluster systems management.

현재 클러스터 시스템은 여러 분야의 문제들을 위하여 폭 넓게 이용되어지고 있으며 유용한 고성능 컴퓨팅 자원으로 인식되고 있다. 클러스터 시스템의 사용자가 늘어남에 따라 클러스터 시스템의 성능 개선 못지 않게 안정적인 운영을 유지하는 것도 중요한 상황이다. 하드웨어 예방 정비가 정상 운영을 위해서 중요한 것임에도 불구하고, 예방 점검 시간에 일반적인 클러스터 시스템을 위하여 사용될 수 있는 검사 도구는 주요 관심사가 되지 못했다. 본 논문에서는 하드웨어 예방 정비를 고려하여 클러스터 시스템을 위한 하드웨어 검사 도구를 제안한다. CTS(Cluster system Test Suite)로 명명된 클러스터 시스템 검사 도구는 메모리와 NIC를 점검하기 위한 두개의 검사 루틴을 가지고 있다. CTS를 설계시, CTS가 일반 적인 클러스터 시스템이 가지는 공통된 특징을 지원하도록 노력하였으며 검사 조건 설정에서 결과 조회까지 모든 작업은 통합 GUI 환경에서 진행될 수 있도록 하였다. 두 종류의 클러스터 시스템을 점검할 때, CTS를 사용하였고 클러스터 시스템을 관리하는데 유용한 정보가 제공됨을 확인하였다.

Keywords

References

  1. TOP 500 Supercomputer sites, http://www.top500.org
  2. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990(1991 Corrected Edition), The Institute of Electrical and Electronics Engineers, Inc., 1994
  3. Mira Kajko-Mattsson, 'Can we learn anything from hardware preventive maintenance?,' Proc. of Seventh IEEE International Conference on Engineering of Complex Computer Systems, pp. 106-111, 2001 https://doi.org/10.1109/ICECCS.2001.930169
  4. V. Biscaglia, C. Malaguti, and M. Paoletti Gualandi, 'Maintenance planning on MV distribution network,' IEE Conference Publication No. 438(14th International Conference and Exhibition on Electricity Distribution. Part 1. Contributions.), Vol. 3(19), pp. 1-4, 1997
  5. M. Kalyanakrish, Z. Kalbarczyk, and R. Iyer, 'Failure data analysis of a LAN of Windows NT based computers,' Proc. of the 18th IEEE Symposium on Reliable Distributed Systems, pp. 178-187, 1999 https://doi.org/10.1109/RELDIS.1999.805094
  6. Jun Xu, Zbigniew Kalbarczyk, and Ravishankar K. Iyer, 'Networked Windows NT system field failure data analysis,' Proc. of Pacific Rim International Symposium on Dependable Computuing, pp. 178-185, 1999 https://doi.org/10.1109/PRDC.1999.816227
  7. Rocks Cluster Distribution : An Open Source High Performance Linux Cluster Solution, http://www.rocksclusters.org/Rocks
  8. OSCAR : Open Source Cluster Application Ressources, http://oscar.openclustergroup.org/tiki-index.php
  9. SCore Cluster System Software 5.6 Documents, http://pdswww.rwcp.or.jp/score/dist/score/html/en/index.html
  10. HP Integrated Lights-Out Advanced, http://h18013.www1.hp.com/products/servers/management/iloadv/index.html
  11. IBM Redbook, 'Implementing Systems Management Solutlons using IBM Director,' 2003, http://publib-b.boulder.ibm.com/Redbooks.nsf/9445fa5b416f6e32852569ae006bb65f/59299a2cbl2fea3f85256c75004e2dd3?OpenDocument
  12. Charies Cazabon, 'Memtester,' http://www.qcc.ca/charlesc/software/memtester
  13. Linux Ethercard Status, Diagnostic and Setup Utilities, http://web.archive.org/web/20030608223511/www.scyld,com/diag
  14. Clusters@Top500, http://clusters.top500.org/
  15. Memtest86 - A Stand alone Memory Diagnostic, http://www.memtest86.com
  16. Michael D. Crawford, 'Using Test Suites to Validate the Linux Kernel,' http://linuxquahty.sunsite.dk/articles/testsuites
  17. James H. Laros III, Lee Ward, Nathan W. Dauchy, James Vasak, Ruth Klundt, Glen Laguna, Marcus Epperson, and Jon R. Stearley, 'The Cluster Integration Toolkit - An Extensible, Portable, Scafable Cluster Management Software Implementation,' Proc of 1st Cluster World Conference and Expo, pp 23-26, 2003
  18. James H. Laros III, Lee Ward, Nathan W. Dauchy, Ron Brightwell, Trammell Hudson, and Ruth Klundt, 'An Extensible, Portable, Scalable, Cluster Management Software Architecture,' Proc. of IEEE International Conference on Cluster Computing, pp 287-295, 2002 https://doi.org/10.1109/CLUSTR.2002.1137757
  19. The Computational Plant Project, http://www.cs.sandia.gov/cplant
  20. The Parallel Tools Consortium, http://www.ptools.org
  21. IBM Redbook, 'Building a Linux HPC Cluster with xCAT,' 2002, http://publib-b.boulder.ibm.com/Redbooks.nsf/0/7b1ce6b3913cafb386256bdb007595e8?OpenDocument& Highlight=0,SG24-6623-00
  22. MySQL Website, http://www.mysql.com