• Title/Summary/Keyword: Massively parallel execution

Search Result 11, Processing Time 0.022 seconds

Design and Implementation of a Massively Parallel Multithreaded Architecture: DAVRID

  • Sangho Ha;Kim, Junghwan;Park, Eunha;Yoonhee Hah;Sangyong Han;Daejoon Hwang;Kim, Heunghwan;Seungho Cho
    • Journal of Electrical Engineering and information Science
    • /
    • v.1 no.2
    • /
    • pp.15-26
    • /
    • 1996
  • MPAs(Massively Parallel Architectures) should address two fundamental issues for scalability: synchronization and communication latency. Dataflow architecture faces problems of excessive synchronization overhead and inefficient execution of sequential programs while they offer the ability to exploit massive parallelism inherent in programs. In contrast, MPAs based on von Neumann computational model may suffer from inefficient synchronization mechanism and communication latency. DAVRID (DAtaflow/Von Neumann RISC hybrID) is a massively parallel multithreaded architecture which takes advantages of von Neumann and dataflow models. It has good single thread performance as well as tolerates synchronization and communication latency. In this paper, we describe the DAVRID architecture in detail and evaluate its performance through simulation runs over several benchmarks.

  • PDF

Development of a drift-flux model based core thermal-hydraulics code for efficient high-fidelity multiphysics calculation

  • Lee, Jaejin;Facchini, Alberto;Joo, Han Gyu
    • Nuclear Engineering and Technology
    • /
    • v.51 no.6
    • /
    • pp.1487-1503
    • /
    • 2019
  • The methods and performance of a pin-level nuclear reactor core thermal-hydraulics (T/H) code ESCOT employing the drift-flux model are presented. This code aims at providing an accurate yet fast core thermal-hydraulics solution capability to high-fidelity multiphysics core analysis systems targeting massively parallel computing platforms. The four equation drift-flux model is adopted for two-phase calculations, and numerical solutions are obtained by applying the Finite Volume Method (FVM) and the Semi-Implicit Method for Pressure-Linked Equation (SIMPLE)-like algorithm in a staggered grid system. Constitutive models involving turbulent mixing, pressure drop, and vapor generation are employed to simulate key phenomena in subchannel-scale analyses. ESCOT is parallelized by a domain decomposition scheme that involves both radial and axial decomposition to enable highly parallelized execution. The ESCOT solutions are validated through the applications to various experiments which include CNEN $4{\times}4$, Weiss et al. two assemblies, PNNL $2{\times}6$, RPI $2{\times}2$ air-water, and PSBT covering single/two-phase and unheated/heated conditions. The parameters of interest for validation include various flow characteristics such as turbulent mixing, spacer grid pressure drop, cross-flow, reverse flow, buoyancy effect, void drift, and bubble generation. For all the validation tests, ESCOT shows good agreements with measured data in the extent comparable to those of other subchannel-scale codes: COBRA-TF, MATRA and/or CUPID. The execution performance is examined with a mini-sized whole core consisting of 89 fuel assemblies and for an OPR1000 core. It turns out that it is about 1.5 times faster than a subchannel code based on the two-fluid three field model and the axial domain decomposition scheme works as well as the radial one yielding a steady-state solution for the OPR1000 core within 30 s with 104 processors.

Modeling, Discovering, and Visualizing Workflow Performer-Role Affiliation Networking Knowledge

  • Kim, Haksung;Ahn, Hyun;Kim, Kwanghoon Pio
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.2
    • /
    • pp.691-708
    • /
    • 2014
  • This paper formalizes a special type of social networking knowledge, which is called "workflow performer-role affiliation networking knowledge." A workflow model specifies execution sequences of the associated activities and their affiliated relationships with roles, performers, invoked-applications, and relevant data. In Particular, these affiliated relationships exhibit a stream of organizational work-sharing knowledge and utilize business process intelligence to explore resources allotting and planning knowledge concealed in the corresponding workflow model. In this paper, we particularly focus on the performer-role affiliation relationships and their implications as organizational and business process intelligence in workflow-driven organizations. We elaborate a series of theoretical formalisms and practical implementation for modeling, discovering, and visualizing workflow performer-role affiliation networking knowledge, and practical details as workflow performer-role affiliation knowledge representation, discovery, and visualization techniques. These theoretical concepts and practical algorithms are based upon information control net methodology for formally describing workflow models, and the affiliated knowledge eventually represents the various degrees of involvements and participations between a group of performers and a group of roles in a corresponding workflow model. Finally, we summarily describe the implications of the proposed affiliation networking knowledge as business process intelligence, and how worthwhile it is in discovering and visualizing the knowledge in workflow-driven organizations and enterprises that produce massively parallel interactions and large-scaled operational data collections through deploying and enacting massively parallel and large-scale workflow models.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.14 no.4
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

Fast Data Assimilation using Kernel Tridiagonal Sparse Matrix for Performance Improvement of Air Quality Forecasting (대기질 예보의 성능 향상을 위한 커널 삼중대각 희소행렬을 이용한 고속 자료동화)

  • Bae, Hyo Sik;Yu, Suk Hyun;Kwon, Hee Yong
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.2
    • /
    • pp.363-370
    • /
    • 2017
  • Data assimilation is an initializing method for air quality forecasting such as PM10. It is very important to enhance the forecasting accuracy. Optimal interpolation is one of the data assimilation techniques. It is very effective and widely used in air quality forecasting fields. The technique, however, requires too much memory space and long execution time. It makes the PM10 air quality forecasting difficult in real time. We propose a fast optimal interpolation data assimilation method for PM10 air quality forecasting using a new kernel tridiagonal sparse matrix and CUDA massively parallel processing architecture. Experimental results show the proposed method is 5~56 times faster than conventional ones.

Architecture design and FPGA implementation of a system control unit for a multiprocessor chip (다중 프로세서 칩을 위한 시스템 제어 장치의 구조설계 및 FPGA 구현)

  • 박성모;정갑천
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.12
    • /
    • pp.9-19
    • /
    • 1997
  • This paper describes the design and FPGA implementation of a system control unit within a multiprocessor chip which can be used as a node processor ina massively parallel processing (MPP) caches, memory management units, a bus unit and a system control unit. Major functions of the system control unit are locking/unlocking of the shared variables of protected access, synchronization of instruction execution among four integer untis, control of interrupts, generation control of processor's status, etc. The system control unit was modeled in very high level using verilog HDL. Then, it was simulated and verified in an environment where trap handler and external interrupt controller were added. Functional blocks of the system control unit were changed into RTL(register transfer level) model and synthesized using xilinx FPGA cell library in synopsys tool. The synthesized system control unit was implemented by Xilinx FPGA chip (XC4025EPG299) after timing verification.

  • PDF

Parallel Computation of FDTD algorithm using CUDA (CUDA를 이용한 FDTD 알고리즘의 병렬처리)

  • Lee, Ho-Young;Park, Jong-Hyun;Kim, Jun-Seong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.4
    • /
    • pp.82-87
    • /
    • 2010
  • Modern GPUs(Graphic Processing Units) provide computing capability higher than that of the general CPUs(Central Processor Units). With supports of programmability of graphics pipeline GP-GPU(General Purpose computation on GPU) has gained much attention expanding its application area. This paper compares sequential and massively parallel implementations of FDTD(Finite Difference Time Domain) algorithm using CUDA(Compute Unified Device Architecture). Experimental results show upto 45X speedup over conventional CPU execution.

Numerical Simulation of Incompressible Laminar Flow around a Propeller Using the Multigrid Technique (멀티그리드 방법을 이용한 프로펠러 주위의 비압축성 층류유동 계산)

  • W.G. Park
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.31 no.4
    • /
    • pp.41-50
    • /
    • 1994
  • An iterative time marching procedure for solving incompressible viscous flows has been applied to the flow around a propeller. This procedure solves three-dimensional Navier-Stokes equations on a moving, body-fitted, non-orthogonal grid using first-order accurate scheme for the time deivatives and second-and third-order accurate schemes for the spatial derivatives. To accelerate iterative process, a multigrid technique has been applied. This procedure is suitable for efficient execution on the current generation of vector or massively parallel computer architectures. Generally good agreement with published experimental and numerical data has been obtained. It was also found that the multigrid technique was efficient in reducing the CPU time needed for the simulation and improved the solution quality.

  • PDF

A Multithreaded Architecture for the Efficient Execution of Vector Computations (벡타 연산을 효율적으로 수행하기 위한 다중 스레드 구조)

  • Yun, Seong-Dae;Jeong, Gi-Dong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.6
    • /
    • pp.974-984
    • /
    • 1995
  • This paper presents a design of a high performance MULVEC (MULtithreaded architecture for the VEctor Computations), as a building block of massively parallel Processing systems. The MULVEC comes from the synthesis of the dataflow model and the extant super sclar RISC microprocesso r. The MULVEC reduces, using status fields, the number of synchronizations in the case of repeated vector computations within the same thread segment, and also reduces the amount of the context switching, network traffic, etc. After be nchmark programs are simulated on the SPARC station 20(super scalar RISC microprocessor)the performance (execution time of programs and the utilization of processors) of MULVEC and the performance(execution time of a program) of *Taccording the different numbers of node are analyzed. We observed that the execution time of the program in MULVEC is faster than that in * T about 1-2 times according the number of nodes and the number of the repetitions of the loop.

  • PDF

A Study on High Speed Image Rotation Algorithm using CUDA (CUDA를 이용한 고속 영상 회전 알고리즘에 관한 연구)

  • Kwon, Hee-Choul;Cho, Hyung-Jin;Kwon, Hee-Yong
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.16 no.5
    • /
    • pp.1-6
    • /
    • 2016
  • Image rotation is one of main pre-processing step in image processing or image pattern recognition. It is implemented with rotation matrix multiplication. However it requires lots of floating point arithmetic operations and trigonometric function calculations, so it takes long execution time. We propose a new high speed image rotation algorithm without two major time-consuming operations. It use just 2 shear translation operations, so it is very fast. In addition, we apply a parallel computing technique with CUDA. CUDA is a massively parallel computing architecture using prevailed GPU recently. As GPU is a dedicated graphic processor, it is exellent for parallel processing of pixels. We compare the proposed algorithm with the conventional rotation one with various size images. Experimental results show that the proposed algorithm is superior to the conventional rotation ones.