• Title/Summary/Keyword: open MPI

Search Result 39, Processing Time 0.022 seconds

Performance Analysis of the Parallel CUPID Code for Various Parallel Programming Models in Symmetric Multi-Processing System (Symmetric Multi-Processing 시스템에서 다양한 병렬 기법 모델을 적용한 병렬 CUPID 코드의 성능분석)

  • Jeon, Byoung Jin;Lee, Jae Ryong;Yoon, Han Young;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.38 no.1
    • /
    • pp.71-79
    • /
    • 2014
  • A parallelization of the bi-conjugate gradient solver for the pressure equation of the CUPID (component unstructured program for interfacial dynamics) code, which was developed for analyzing the components of a pressurized water-cooled reactor, was studied in a symmetric multi-processing system. The parallel performance was investigated for three typical parallel programming models (MPI, OpenMP, Hybrid) by solving incompressible backward-facing step flow at various grid resolutions. It was confirmed that parallel performance was low when problem size was small or the memory requirement for each thread was considerably higher than the cache memory. Furthermore, it was shown that MPI was better than OpenMP regardless of the problem size, and Hybrid was the best when the number of threads was relatively small.

Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

  • Choi, HyeonSeong;Kim, Youngrang;Lee, Jaehwan;Kim, Yoonhee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.3
    • /
    • pp.911-931
    • /
    • 2021
  • Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All-reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

Hybrid Parallelization for High Performance of CFD_NIMR Model (기상 모델 CFD_NIMR의 최적 성능을 위한 혼합형 병렬 프로그램 구현)

  • Kim, Min-Wook;Choi, Young-Jean;Kim, Young-Tae
    • Atmosphere
    • /
    • v.22 no.1
    • /
    • pp.109-115
    • /
    • 2012
  • We parallelized the CFD_NIMR model, which is a numerical meteorological model, for best performance on both of distributed and shared memory parallel computers. This hybrid parallelization uses MPI (Message Passing Interface) to apply horizontal 2-dimensional sub-domain out of the 3-dimensional computing domain for distributed memory system, as well as uses OpenMP (Open Multi-Processing) to apply vertical 1-dimensional sub-domain for utilizing advantage of shared memory structure. We validated the parallel model with the original sequential model, and the parallel CFD_NIMR model shows efficient speedup on the distributed and shared memory system.

Performance Optimization of Parallel Algorithms

  • Hudik, Martin;Hodon, Michal
    • Journal of Communications and Networks
    • /
    • v.16 no.4
    • /
    • pp.436-446
    • /
    • 2014
  • The high intensity of research and modeling in fields of mathematics, physics, biology and chemistry requires new computing resources. For the big computational complexity of such tasks computing time is large and costly. The most efficient way to increase efficiency is to adopt parallel principles. Purpose of this paper is to present the issue of parallel computing with emphasis on the analysis of parallel systems, the impact of communication delays on their efficiency and on overall execution time. Paper focuses is on finite algorithms for solving systems of linear equations, namely the matrix manipulation (Gauss elimination method, GEM). Algorithms are designed for architectures with shared memory (open multiprocessing, openMP), distributed-memory (message passing interface, MPI) and for their combination (MPI + openMP). The properties of the algorithms were analytically determined and they were experimentally verified. The conclusions are drawn for theory and practice.

Comparison of Parallel Computation Performances for 3D Wave Propagation Modeling using a Xeon Phi x200 Processor (제온 파이 x200 프로세서를 이용한 3차원 음향 파동 전파 모델링 병렬 연산 성능 비교)

  • Lee, Jongwoo;Ha, Wansoo
    • Geophysics and Geophysical Exploration
    • /
    • v.21 no.4
    • /
    • pp.213-219
    • /
    • 2018
  • In this study, we simulated 3D wave propagation modeling using a Xeon Phi x200 processor and compared the parallel computation performance with that using a Xeon CPU. Unlike the 1st generation Xeon Phi coprocessor codenamed Knights Corner, the 2nd generation x200 Xeon Phi processor requires no additional communication between the internal memory and the main memory since it can run an operating system directly. The Xeon Phi x200 processor can run large-scale computation independently, with the large main memory and the high-bandwidth memory. For comparison of parallel computation, we performed the modeling using the MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) libraries. Numerical examples using the SEG/EAGE salt model demonstrated that we can achieve 2.69 to 3.24 times faster modeling performance using the Xeon Phi with a large number of computational cores and high-bandwidth memory compared to that using the 12-core CPU.

Parallel Computing Environment for R with on Supercomputer Systems (빅데이터 분석을 위한 슈퍼컴퓨터 환경에서 R의 병렬처리)

  • Lee, Sang Yeol;Won, Joong Ho
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.39 no.4
    • /
    • pp.19-31
    • /
    • 2014
  • We study parallel processing techniques for the R programming language of high performance computing technology. In this study, we used massively parallel computing system which has 25,408 cpu cores. We conducted a performance evaluation of a distributed memory system using MPI and of a the shared memory system using OpenMP. Our findings are summarized as follows. First, For some particular algorithms, parallel processing is about 150 times faster than serial processing in R. Second, the distributed memory system gets faster as the number of nodes increases while shared memory system is limited in the improvement of performance, due to the limit of the number of cpus in a single system.

Numerical discrepancy between serial and MPI parallel computations

  • Lee, Sang Bong
    • International Journal of Naval Architecture and Ocean Engineering
    • /
    • v.8 no.5
    • /
    • pp.434-441
    • /
    • 2016
  • Numerical simulations of 1D Burgers equation and 2D sloshing problem were carried out to study numerical discrepancy between serial and parallel computations. The numerical domain was decomposed into 2 and 4 subdomains for parallel computations with message passing interface. The numerical solution of Burgers equation disclosed that fully explicit boundary conditions used on subdomains of parallel computation was responsible for the numerical discrepancy of transient solution between serial and parallel computations. Two dimensional sloshing problems in a rectangular domain were solved using OpenFOAM. After a lapse of initial transient time sloshing patterns of water were significantly different in serial and parallel computations although the same numerical conditions were given. Based on the histograms of pressure measured at two points near the wall the statistical characteristics of numerical solution was not affected by the number of subdomains as much as the transient solution was dependent on the number of subdomains.

Simulated of flow in a three-dimensional porous structure by using the IB-SEM system

  • Wang, Jing;Li, Shucai;Li, Liping;Song, Shuguang;Lin, Peng;Ba, Xingzhi
    • Geomechanics and Engineering
    • /
    • v.18 no.6
    • /
    • pp.651-659
    • /
    • 2019
  • The IB-SEM numerical method combines the spectral/hp element method and the rigid immersed boundary method. This method avoids the problems of low computational efficiency and errors that are caused by the re-division of the grid when the solids move. Based on the Fourier transformation and the 3D immersed boundary method, the 3D IB-SEM system was established. Then, using the open MPI and the Hamilton HPC service, the computational efficiency was increased substantially. The flows around a cylinder and a sphere were simulated by the system. The surface of the cylinder generates vortices with alternating shedding, and these vortices result in a periodic force acting on the surface of the cylinder. When the shedding vortices enter the flow field behind the cylinder, a recirculation zone is formed. Finally, the three-dimensional pore flow was successfully investigated.

OpenMP Implementation using POSIX thread library on ARM11MPCore (ARM11MPCore에서 POSIX 쓰레드를 이용한 OpenMP 구현)

  • Lee, Jae-Won;Jeun, Woo-Chul;Ha, Soon-Hoi
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10b
    • /
    • pp.414-418
    • /
    • 2007
  • 멀티프로세서 환경에서 OpenMP는 MPI 에 비해 병렬 프로그래밍을 쉽게 할 수 있다는 장점을 가지고 있고, OpenMP는 표준이 없는 병렬 프로그래밍 세계에서 실질적인 표준으로써 인정받고 있다. OPenMP는 대상 플랫폼에 따라 OpenMP 구현을 다르게 해야 하기 때문에 새로운 프로세서가 등장하면 그에 맞는 OpenMP구현을 만들어야 한다. 이 논문에선 다중 프로세서 시스템-온-칩 시스템인 ARM11MPCore 시스템 위에 POSIX 쓰레드에 기반하여 OpenMP 환경을 구축하고 그 성능을 측정한다.

  • PDF