• Title/Summary/Keyword: parallel library

Search Result 188, Processing Time 0.023 seconds

Comparative Analysis of Container for High Performance Computing

  • Lee, Jaeryun;Chae, Yunchang;Tak, Byungchul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.9
    • /
    • pp.11-20
    • /
    • 2020
  • In this paper, we propose the possibility of using containers in the HPC ecosystem and the criteria for selecting a proper PMI library. Although demand for container has been growing rapidly in the HPC ecosystem, Docker container which is the most widely used has a potential security problem and is not suitable for the HPC. Therefore, several HPC containers have appeared to solve this problem and the chance of performance differences also emerged. For this reason, we measured the performance difference between each HPC container and Docker container through NAS Parallel Benchmark experiment and checked the effect of the type of PMI library. As a result, the HPC container and the Docker container showed almost the same performance as native, or in some cases, rather better performance was observed. In the result of comparison between PMI libraries showed that PMIx was not superior to PMI-2 in all conditions.

Implementation and Performance Evaluation of Socket and RMI based Java Message Passing Systems (소켓 및 RMI 기반 자바 메시지 전달 시스템의 구현 및 성능평가)

  • Bang, Seung-Jun;Ahn, Jin-Ho
    • Journal of Internet Computing and Services
    • /
    • v.8 no.5
    • /
    • pp.11-20
    • /
    • 2007
  • This paper designs and implements a message passing library called JMPI (Java Message Passing Interface) which complies with MPJ (Message Passing in Java), the MPI standard Specification for Java language, This library provides some graphic user interface tools to enable parallel computing environments to be configured very simply by their administrators and JMPI applications to be executed very conveniently. Also in this paper, we implement two versions of systems using Socket and RPC which are both typical distributed system communication mechanisms and with three benchmark applications, compare performance of these systems with that of an existing system JPVM depending on the increasing number of the computers. Experimental results show that our systems outperform JPVM system in terms of various aspects and that the most efficient processing speedup can be obtained by increasing the number of the computers in consideration of network traffic through processing evaluation. Finally, we can see that, as the number of computers increases, using RMI to transmit a message is more effective than using object streams attached to sockets to transmit a message.

  • PDF

A Parallel I/O System on Workstation Clustering Environment for Irregular Applications (비정형 응용을 위한 워크스테이션 클러스터링 환경에서의 병렬 입출력 시스템)

  • No, Jae-Chun;Park, Sung-Soon;Choudhary, Alok
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.5
    • /
    • pp.496-505
    • /
    • 2000
  • Clusters of workstations (COW) are becoming an attractive option for parallel scientific computing, a field formerly reserved to the MPPs, because their cost-performance ratio is usuallybetter than that of comparable MPPS, and their hardware and software can be easily enhanced to thelatest generations. In this paper we present the design and implementation of our runtime library forclusters of workstations, called "Collective I/O Clustering". The library provides a friendlyprogramming model for the I/O of irregular applications on clusters of workstations, being completelyintegrated with the underlying communication and I/O system. In the collective I/O clustering, two I/Oconfigurations are possible. In the first I/O configuration, all processors allocated can act as I/Oservers as well as compute nodes. In the second I/O configuration, only a subset of processors canact as I/O servers, The compression and software caching facilities have been incorporated into thecollective 1/0 clustering to optimize the communication and I/O costs. All the performance results wereobtained on the IBM-SP machine, located at Argonne National Labs.

  • PDF

Excavation of 3-amino-2-benzylimino-1,3-thiazolines, Selective Fungicide against Phytophthora infestans and Magnaporthe grisea (토마토 역병균과 벼 도열병균에 선택적인 살균활성의 3-아미노-2-벤질이미노-1,3-티아졸린 유도체 발굴)

  • Hahn, Hoh-Gyu;Nam, Kee-Dal;Shin, Dong-Yoon;Choi, Gyung-Ja;Cho, Kwang-Yun
    • The Korean Journal of Pesticide Science
    • /
    • v.10 no.3
    • /
    • pp.165-171
    • /
    • 2006
  • A new 3-amino-1,3-thiazoline chemical library was synthesized through parallel synthetic technology and in vivo antifungal activity of the compounds were investigated against the typical 6 plant diseases (100 ppm). The characteristic feature of these derivatives was that both a benzyl moiety in C-2 imino and an amino group in C-3 of 2-imino-1,3-thiazoline scaffold were substituted in the molecule respectively. Some compounds showed antifungal activity with selectivity against tomato late blight and rice blast. The fungitoxicity would be attributed to 3,4-dichlorophenyl moiety of the benzyl group.

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm (병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.4
    • /
    • pp.739-745
    • /
    • 2017
  • This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

Implementation of a Parallel Viterbi Decoder for High Speed Multimedia Communications (멀티미디어 통신용 병렬 아키텍쳐 고속 비터비 복호기 설계)

  • Lee, Byeong-Cheol
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.37 no.2
    • /
    • pp.78-84
    • /
    • 2000
  • The Viterbi decoders can be classified into serial Viterbi decoders and parallel Viterbi decoders. Parallel Viterbi decoders can handle higher data rates than serial Viterbl decoders. This paper designs and implements a fully parallel Viterbi decoder for high speed multimedia communications. For high speed operations, the ACS (Add-Compare-Select) module consisting of 64 PEs (Processing Elements) can compute one stage in a clock. In addition, the systolic away structure with 32 pipeline stages is developed for the TB (traceback) module. The implemented Viterbi decoder can support code rates 1/2, 2/3, 3/4, 5/6 and 7/8 using punctured codes. We have developed Verilog HDL models and performed logic synthesis. The 0.6 ${\mu}{\textrm}{m}$ SAMSUNG KG75000 SOG cell library has been used. The implemented Viterbi decoder has about 100,400 gates, and is running at 70 MHz in the worst case simulation.

  • PDF

A study on the Cost-effective Architecture Design of High-speed Soft-decision Viterbi Decoder for Multi-band OFDM Systems (Multi-band OFDM 시스템용 고속 연판정 비터비 디코더의 효율적인 하드웨어 구조 설계에 관한 연구)

  • Lee, Seong-Joo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.11 s.353
    • /
    • pp.90-97
    • /
    • 2006
  • In this paper, we present a cost-effective architecture of high-speed soft-decision Viterbi decoder for Multi-band OFDM(MB-OFDM) systems. In the design of modem for MB-OFDM systems, a parallel processing architecture is general]y used for the reliable hardware implementation, because the systems should support a very high-speed data rate of at most 480Mbps. A Viterbi decoder also should be designed by using a parallel processing structure and support a very high-speed data rate. Therefore, we present a optimized hardware architecture for 4-way parallel processing Viterbi decoder in this paper. In order to optimize the hardware of Viterbi decoder, we compare and analyze various ACS architectures and find the optimal one among them with respect to hardware complexity and operating frequency The Viterbi decoder with a optimal hardware architecture is designed and verified by using Verilog HDL, and synthesized into gate-level circuits with TSMC 0.13um library. In the synthesis results, we find that the Viterbi decoder contains about 280K gates and works properly at the speed required in MB-OFDM systems.

Parallel Architecture Design of H.264/AVC CAVLC for UD Video Realtime Processing (UD(Ultra Definition) 동영상 실시간 처리를 위한 H.264/AVC CAVLC 병렬 아키텍처 설계)

  • Ko, Byung Soo;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.5
    • /
    • pp.112-120
    • /
    • 2013
  • In this paper, we propose high-performance H.264/AVC CAVLC encoder for UD video real time processing. Statistical values are obtained in one cycle through the parallel arithmetic and logical operations, using non-zero bit stream which represents zero coefficient or non-zero coefficient. To encode codeword per one cycle, we remove recursive operation in level encoding through parallel comparison for coefficient and escape value. In oder to implement high-speed circuit, proposed CAVLC encoder is designed in two-stage {statical scan, codeword encoding} pipeline. Reducing the encoding table, the arithmetic unit is used to encode non-coefficient and to calculate the codeword. The proposed architecture was simulated in 0.13um standard cell library. The gate count is 33.4Kgates. The architecture can support Ultra Definition Video ($3840{\times}2160$) at 100 frames per second by running at 100MHz.

Comparisons of Parallel Preconditioners for the Computation of Interior Eigenvalues by the Minimization of Rayleigh Quotient (레이레이 계수의 최소화에 의한 내부고유치 계산을 위한 병렬준비행렬들의 비교)

  • Ma, Sang-back;Jang, Ho-Jong
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.137-140
    • /
    • 2003
  • Recently, CG (Conjugate Gradient) scheme for the optimization of the Rayleigh quotient has been proven a very attractive and promising technique for interior eigenvalues for the following eigenvalue problem, Ax=λx (1) The given matrix A is assummed to be large and sparse, and symmetric. Also, the method is very amenable to parallel computations. A proper choice of the preconditioner significantly improves the convergence of the CG scheme. We compare the parallel preconditioners for the computation of the interior eigenvalues of a symmetric matrix by CG-type method. The considered preconditioners are Point-SSOR, ILU (0) in the multi-coloring order, and Multi-Color Block SSOR (Symmetric Succesive OverRelaxation). We conducted our experiments on the CRAY­T3E with 128 nodes. The MPI (Message Passing Interface) library was adopted for the interprocessor communications. The test matrices are up to $512{\times}512$ in dimensions and were created from the discretizations of the elliptic PDE. All things considered the MC-BSSOR seems to be most robust preconditioner.

Parallel Computing Strategies for High-Speed Impact into Ceramic/Metal Plates (세라믹/금속판재의 고속충돌 파괴 유한요소 병렬 해석기법)

  • Moon, Ji-Joong;Kim, Seung-Jo;Lee, Min-Hyung
    • Journal of the Computational Structural Engineering Institute of Korea
    • /
    • v.22 no.6
    • /
    • pp.527-532
    • /
    • 2009
  • In this paper simulations for the impact into ceramics and/or metal materials have been discussed. To model discrete nature for fracture and damage of brittle materials, we implemented cohesive-law fracture model with a node separation algorithm for the tensile failure and Mohr-Coulomb model for the compressive loading. The drawback of this scheme is that it requires a heavy computational time. This is because new nodes are generated continuously whenever a new crack surface is created. In order to reduce the amount of calculation, parallelization with MPI library has been implemented. For the high-speed impact problems, the mesh configuration and contact calculation changes continuously as time step advances and it causes unbalance of computational load of each processor. Dynamic load balancing technique which re-allocates the loading dynamically is used to achieve good parallel performance. Some impact problems have been simulated and the parallel performance and accuracy of the solutions are discussed.