• 제목/요약/키워드: open MPI

검색결과 39건 처리시간 0.032초

Symmetric Multi-Processing 시스템에서 다양한 병렬 기법 모델을 적용한 병렬 CUPID 코드의 성능분석 (Performance Analysis of the Parallel CUPID Code for Various Parallel Programming Models in Symmetric Multi-Processing System)

  • 전병진;이재룡;윤한영;최형권
    • 대한기계학회논문집B
    • /
    • 제38권1호
    • /
    • pp.71-79
    • /
    • 2014
  • 본 연구에서는 가압경수로 주요 기기의 고정밀 열수력 해석을 위한 CUPID(Component Unstructured Program for Interfacial Dynamics) 코드의 압력장 해석을 위한 이중공액구배법(Bi-Conjugate Gradient) 알고리즘의 병렬화를 SMP(Symmetric Multi Processing) 시스템에서 고찰한다. 비압축성 후향계단 유동문제의 병렬해석을 다양한 격자 조밀도를 가지는 격자들에 대하여 세 가지 대표적인 병렬 기법(MPI, OpenMP, 하이브리드)을 적용하여 병렬성능 비교를 수행하였다. 병렬처리 성능은 해석 문제의 크기뿐만 아니라 캐쉬 메모리 크기에도 영향을 받으므로, 전체 계산량이 매우 적거나 개별 쓰레드에 사용되는 메모리가 캐쉬 메모리보다 매우 큰 경우에는 병렬화에 의한 성능 향상이 낮음을 확인하였다. 또한, 문제 크기에 상관없이 MPI 기법이 OpenMP보다 성능이 우수했으며, 상대적으로 적은 쓰레드를 사용한 경우엔 하이브리드 기법이 가장 우수한 성능을 보였다.

Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

  • Choi, HyeonSeong;Kim, Youngrang;Lee, Jaehwan;Kim, Yoonhee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권3호
    • /
    • pp.911-931
    • /
    • 2021
  • Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All-reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

기상 모델 CFD_NIMR의 최적 성능을 위한 혼합형 병렬 프로그램 구현 (Hybrid Parallelization for High Performance of CFD_NIMR Model)

  • 김민욱;최영진;김영태
    • 대기
    • /
    • 제22권1호
    • /
    • pp.109-115
    • /
    • 2012
  • We parallelized the CFD_NIMR model, which is a numerical meteorological model, for best performance on both of distributed and shared memory parallel computers. This hybrid parallelization uses MPI (Message Passing Interface) to apply horizontal 2-dimensional sub-domain out of the 3-dimensional computing domain for distributed memory system, as well as uses OpenMP (Open Multi-Processing) to apply vertical 1-dimensional sub-domain for utilizing advantage of shared memory structure. We validated the parallel model with the original sequential model, and the parallel CFD_NIMR model shows efficient speedup on the distributed and shared memory system.

Performance Optimization of Parallel Algorithms

  • Hudik, Martin;Hodon, Michal
    • Journal of Communications and Networks
    • /
    • 제16권4호
    • /
    • pp.436-446
    • /
    • 2014
  • The high intensity of research and modeling in fields of mathematics, physics, biology and chemistry requires new computing resources. For the big computational complexity of such tasks computing time is large and costly. The most efficient way to increase efficiency is to adopt parallel principles. Purpose of this paper is to present the issue of parallel computing with emphasis on the analysis of parallel systems, the impact of communication delays on their efficiency and on overall execution time. Paper focuses is on finite algorithms for solving systems of linear equations, namely the matrix manipulation (Gauss elimination method, GEM). Algorithms are designed for architectures with shared memory (open multiprocessing, openMP), distributed-memory (message passing interface, MPI) and for their combination (MPI + openMP). The properties of the algorithms were analytically determined and they were experimentally verified. The conclusions are drawn for theory and practice.

제온 파이 x200 프로세서를 이용한 3차원 음향 파동 전파 모델링 병렬 연산 성능 비교 (Comparison of Parallel Computation Performances for 3D Wave Propagation Modeling using a Xeon Phi x200 Processor)

  • 이종우;하완수
    • 지구물리와물리탐사
    • /
    • 제21권4호
    • /
    • pp.213-219
    • /
    • 2018
  • 본 연구에서는 제온 파이 x200 프로세서를 이용하여 3차원 파동 전파 모델링을 수행하고 기존의 제온 CPU를 사용한 경우와 병렬 연산 성능을 비교하였다. 제온 파이 1세대 프로세서인 제온 파이 나이츠 코너 보조프로세서와 달리 제온 파이 2세대 프로세서인 x200 프로세서는 직접 운영체제 실행이 가능하므로 내장 메모리와 주메모리 사이의 추가적인 통신이 필요 없다. 또한 제온 파이 x200 프로세서는 대용량 주메모리와 고대역폭 메모리를 이용하여 대규모 컴퓨팅을 독립적으로 실행할 수 있다. 병렬 연산 성능 비교를 위해 MPI (Message Passing Interface)와 OpenMP (Open Multi-Processing)를 이용해 모델링을 수행하였다. SEG/EAGE 암염돔 모델을 이용한 수치 실험 결과 제온 파이에서 다량의 연산 코어와 고대역폭 메모리를 이용해 12 코어 CPU 대비 2.69 ~ 3.24배 우수한 모델링 성능을 얻을 수 있었다.

빅데이터 분석을 위한 슈퍼컴퓨터 환경에서 R의 병렬처리 (Parallel Computing Environment for R with on Supercomputer Systems)

  • 이상열;원중호
    • 한국경영과학회지
    • /
    • 제39권4호
    • /
    • pp.19-31
    • /
    • 2014
  • We study parallel processing techniques for the R programming language of high performance computing technology. In this study, we used massively parallel computing system which has 25,408 cpu cores. We conducted a performance evaluation of a distributed memory system using MPI and of a the shared memory system using OpenMP. Our findings are summarized as follows. First, For some particular algorithms, parallel processing is about 150 times faster than serial processing in R. Second, the distributed memory system gets faster as the number of nodes increases while shared memory system is limited in the improvement of performance, due to the limit of the number of cpus in a single system.

Numerical discrepancy between serial and MPI parallel computations

  • Lee, Sang Bong
    • International Journal of Naval Architecture and Ocean Engineering
    • /
    • 제8권5호
    • /
    • pp.434-441
    • /
    • 2016
  • Numerical simulations of 1D Burgers equation and 2D sloshing problem were carried out to study numerical discrepancy between serial and parallel computations. The numerical domain was decomposed into 2 and 4 subdomains for parallel computations with message passing interface. The numerical solution of Burgers equation disclosed that fully explicit boundary conditions used on subdomains of parallel computation was responsible for the numerical discrepancy of transient solution between serial and parallel computations. Two dimensional sloshing problems in a rectangular domain were solved using OpenFOAM. After a lapse of initial transient time sloshing patterns of water were significantly different in serial and parallel computations although the same numerical conditions were given. Based on the histograms of pressure measured at two points near the wall the statistical characteristics of numerical solution was not affected by the number of subdomains as much as the transient solution was dependent on the number of subdomains.

Simulated of flow in a three-dimensional porous structure by using the IB-SEM system

  • Wang, Jing;Li, Shucai;Li, Liping;Song, Shuguang;Lin, Peng;Ba, Xingzhi
    • Geomechanics and Engineering
    • /
    • 제18권6호
    • /
    • pp.651-659
    • /
    • 2019
  • The IB-SEM numerical method combines the spectral/hp element method and the rigid immersed boundary method. This method avoids the problems of low computational efficiency and errors that are caused by the re-division of the grid when the solids move. Based on the Fourier transformation and the 3D immersed boundary method, the 3D IB-SEM system was established. Then, using the open MPI and the Hamilton HPC service, the computational efficiency was increased substantially. The flows around a cylinder and a sphere were simulated by the system. The surface of the cylinder generates vortices with alternating shedding, and these vortices result in a periodic force acting on the surface of the cylinder. When the shedding vortices enter the flow field behind the cylinder, a recirculation zone is formed. Finally, the three-dimensional pore flow was successfully investigated.

ARM11MPCore에서 POSIX 쓰레드를 이용한 OpenMP 구현 (OpenMP Implementation using POSIX thread library on ARM11MPCore)

  • 이재원;전우철;하순회
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2007년도 가을 학술발표논문집 Vol.34 No.2 (B)
    • /
    • pp.414-418
    • /
    • 2007
  • 멀티프로세서 환경에서 OpenMP는 MPI 에 비해 병렬 프로그래밍을 쉽게 할 수 있다는 장점을 가지고 있고, OpenMP는 표준이 없는 병렬 프로그래밍 세계에서 실질적인 표준으로써 인정받고 있다. OPenMP는 대상 플랫폼에 따라 OpenMP 구현을 다르게 해야 하기 때문에 새로운 프로세서가 등장하면 그에 맞는 OpenMP구현을 만들어야 한다. 이 논문에선 다중 프로세서 시스템-온-칩 시스템인 ARM11MPCore 시스템 위에 POSIX 쓰레드에 기반하여 OpenMP 환경을 구축하고 그 성능을 측정한다.

  • PDF