DOI QR코드

DOI QR Code

HPC Cluster-based Customized Container Image Manager and Builder

HPC 클러스터 기반 사용자 맞춤형 컨테이너 이미지 관리자 및 빌더

  • Gukhua Lee (Korea Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information) ;
  • Joon Woo (Korea Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information) ;
  • Taeyoung Hong (Korea Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information)
  • Received : 2024.05.14
  • Accepted : 2024.09.26
  • Published : 2024.10.31

Abstract

This paper introduces a novel approach for managing and building customized container images in high-performance computing (HPC) environments, addressing the growing need for flexibility, scalability, and efficiency in computational workflows. Our contributions include the development and integration of a custom container image manager and builder within a container-based HPC infrastructure. This system enables users to effortlessly create, manage, and deploy personalized AI service platforms, significantly enhancing the user experience by reducing the time and effort required to configure essential packages and frameworks. The image manager we developed is capable of processing multiple user requests concurrently, distributing tasks efficiently to image builders operating on compute nodes. Meanwhile, the image builder is designed to handle queued tasks, generate customized container images based on active instances, and store these images in a private container registry, ensuring seamless access and reusability. We validated our system's effectiveness by implementing it on HPC cluster-based systems, including the Nurion supercomputer and the Neuron GPU system, demonstrating its scalability and interoperability in real-world environments. Additionally, we established an architecture and mechanism that ensures seamless integration with existing container-based supercomputing frameworks, underscoring our system's capability to optimize resource utilization and streamline the deployment of AI service platforms.

본 논문은 고성능컴퓨팅 환경에서 사용자 맞춤형 컨테이너 이미지를 관리하고 구축하기 위한 새로운 접근 방식을 소개하며, 컴퓨팅 워크플로에서 유연성, 확장성 및 효율성에 대한 증가하는 요구사항 들을 해결하였다. 이 논문의 기여에는 컨테이너 기반 고성능컴퓨팅 인프라 내에서 사용자 맞춤형 컨테이너 이미지 관리자 및 빌더의 개발과 통합이 포함된다. 이 시스템을 통해 사용자는 맞춤화된 AI 서비스 플랫폼을 손쉽게 생성하고, 관리하고, 배포할 수 있어 필요한 패키지와 프레임워크를 구성하는 데 적은 시간과 노력을 들일 수 있다. 논문에서 개발한 이미지 관리자는 여러 사용자 요청을 동시에 처리하여 컴퓨팅 노드에서 작동하는 이미지 빌더에 작업을 효율적으로 분배할 수 있다. 이미지 빌더는 대기 중인 작업을 처리하고 실행 중인 인스턴스를 대상으로 사용자 맞춤형 컨테이너 이미지를 생성하고, 이러한 이미지를 프라이빗 컨테이너 레지스트리에 저장하도록 설계되어 원활한 액세스와 재사용성을 보장한다. 본 연구는 누리온 슈퍼컴퓨터와 뉴론 GPU 시스템을 포함한 고성능컴퓨팅 클러스터 기반 시스템에 구현하여 시스템의 기능 효과를 검증하여 실제 서비스 환경에서 확장성과 상호 운용성을 입증하였다. 또한 본 논문에서는 기존 컨테이너 기반 슈퍼컴퓨팅 프레임워크와의 원활한 통합을 보장하는 아키텍처와 메커니즘을 제안하여 리소스 활용을 최적화하고 AI 서비스 플랫폼의 배포 복잡성을 줄이는 역할을 하였다.

Keywords

Acknowledgement

This research has been performed as a project of Project No. K24L2M1C1 (The national flagship supercomputer infrastructure implementation and service) supported by the Korea Institute of Science and Technology Information (KISTI).

References

  1. Diogo R. Ferreira, and JET Contributors, "Using HPC Infrastructures for Deep Learning Applications in Fusion Research," Plasma Physics and Controlled Fusion, Vol 63., No. 8, 2021. https://doi.org/10.1088/1361-6587/ac0a3b
  2. Estela Suarez, Norbert Eicker, Thomas Moschny, Simon Pickartz, Carsten Clauss, Valentin Plugaru, Andreas Herten, Kristel Michielsen, Thomas Lippert, "Modular Supercomputing Architecture," White Paper of a Success Story of European R&D, 2022. https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_MSA_20220519.pdf
  3. I. Ali, A. Khan and M. Waleed, "A Google Colab Based Online Platform for Rapid Estimation of Real Blur in Single-Image Blind Deblurring," 2020 12th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Bucharest, Romania, pp. 1-6, 2020. https://doi.org/10.1109/ECAI50035.2020.9223244.
  4. A. J. Younge, K. Pedretti, R. E. Grant and R. Brightwell, "A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds," 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Hong Kong, China, pp. 74-81, 2017. https://doi.org/10.1109/CloudCom.2017.40.
  5. J. Hursey, "Design Considerations for Building and Running Containerized MPI Applications," 2020 2nd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), Atlanta, GA, USA, pp. 35-44, 2020. https://doi.org/10.1109/CANOPIEHPC51917.2020.00010.
  6. S. T. Singh, M. Tiwari and A. S. Dhar, "Machine Learning based Workload Prediction for Auto-scaling Cloud Applications," 2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON), Raigarh, Chhattisgarh, India, pp. 1-6, 2023. https://doi.org/10.1109/OTCON56053.2023.10114033.
  7. S. Brown, O. Johnson and A. Tassi, "Reliability of Broadcast Communications Under Sparse Random Linear Network Coding," in IEEE Transactions on Vehicular Technology, vol. 67, no. 5, pp. 4677-4682, May 2018, https://doi.org/10.1109/TVT.2018.2790436.
  8. M. Riedel et al., "Practice and Experience in using Parallel and Scalable Machine Learning with Heterogenous Modular Supercomputing Architectures," 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, pp. 76-85, 2021. https://doi.org/10.1109/IPDPSW52791.2021.00019.
  9. F. Torres-Cruz et al., "Comparative Analysis of High-Performance Computing Systems and Machine Learning in Enhancing Cyber Infrastructure: A Multiple Regression Analysis Approach," 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM), Gautam Buddha Nagar, India, pp. 69-73, 2022. https://doi.org/10.1109/ICIPTM54933.2022.9753839.
  10. T. Ben-Nun, T. Gamblin, D. S. Hollman, H. Krishnan and C. J. Newburn, "Workflows are the New Applications: Challenges in Performance, Portability, and Productivity," 2020 IEEE/ACM International Workshop on Performance, PortabiliMty and Productivity in HPC (P3HPC), GA, USA, pp. 57-69, 2020. https://doi.org/10.1109/P3HPC51967.2020.00011.
  11. S. Zhang, J. Lomeo, "Cloud-based Image Management Solutions for Digital Transformation of Drug Product Development," Microscopy and Microanalysis, Vol. 27, No. S1, pp. 296-297, 2021. https://doi.org/10.1017/S143192762100163X
  12. N. K. Pandey and M. Diwakar, "A Review on Cloud based Image Processing Services," 2020 7th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, pp. 108-112, 2020. https://doi.org/10.23919/INDIACom49435.2020.9083718.
  13. P. Kanjanamek, N. Chaiyabud, N. Kitikhungumjon and S. Fugkeaw, "An Adaptive Cloud-Based Image Steganography System with Fast Stego Retrieval," 2024 16th International Conference on Knowledge and Smart Technology (KST), Krabi, Thailand, pp. 29-34, 2024. https://doi.org/10.1109/KST61284.2024.10499672.
  14. A. Abdelmageed et al., "Cloud-Based AI-Enhanced Dual-Mode System For Automatic Coronary Artery Calcification Detection and Quantification," 2024 41st National Radio Science Conference (NRSC), New Damietta, Egypt, pp. 270-277, 2024. https://doi.org/10.1109/NRSC61581.2024.10510468.
  15. Z. Wu, P. Ma, X. Zhang and G. Ye, "Efficient Management and Processing of Massive InSAR Images Using an HPC-Based Cloud Platform," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 2866-2876, 2024. https://doi.org/10.1109/JSTARS.2023.3349214.
  16. J. Diaz, G. von Laszewski, F. Wang and G. Fox, "Abstract Image Management and Universal Image Registration for Cloud and HPC Infrastructures," 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, USA, pp. 463-470, 2012. https://doi.org/10.1109/CLOUD.2012.94.
  17. S. Takizawa, M. Shimizu, H. Nakada, H. Matsuba and R. Takano, "CloudQ: A Secure AI / HPC Cloud Bursting System," 2022 IEEE/ACM International Workshop on HPC User Support Tools (HUST), Dallas, TX, USA, pp. 48-50, 2022. https://doi.org/10.1109/HUST56722.2022.00012
  18. A. Zhiravetska, J. Chaiko, N. Kunicina and J. Maksimkina, "Study Courses Digitalisation at RTU On the Basis of HPC Platform and Combined Learning Methodology," 2023 IEEE 64th International Scientific Conference on Power and Electrical Engineering of Riga Technical University (RTUCON), Riga, Latvia, pp. 1-6, 2023. https://doi.org/10.1109/RTUCON60080.2023.10412966.