• Title/Summary/Keyword: Parallel GPU

Search Result 284, Processing Time 0.026 seconds

GPU-based Monte Carlo Photon Migration Algorithm with Path-partition Load Balancing

  • Jeon, Youngjin;Park, Jongha;Hahn, Joonku;Kim, Hwi
    • Current Optics and Photonics
    • /
    • v.5 no.6
    • /
    • pp.617-626
    • /
    • 2021
  • A parallel Monte Carlo photon migration algorithm for graphics processing units that implements an improved load-balancing strategy is presented. Conventional parallel Monte Carlo photon migration algorithms suffer from a computational bottleneck due to their reliance on a simple load-balancing strategy that does not take into account the different length of the mean free paths of the photons. In this paper, path-partition load balancing is proposed to eliminate this computational bottleneck based on a mathematical formula that parallelizes the photon path tracing process, which has previously been considered non-parallelizable. The performance of the proposed algorithm is tested using three-dimensional photon migration simulations of a human skin model.

GPU-ACCELERATED SPECKLE MASKING RECONSTRUCTION ALGORITHM FOR HIGH-RESOLUTION SOLAR IMAGES

  • Zheng, Yanfang;Li, Xuebao;Tian, Huifeng;Zhang, Qiliang;Su, Chong;Shi, Lingyi;Zhou, Ta
    • Journal of The Korean Astronomical Society
    • /
    • v.51 no.3
    • /
    • pp.65-71
    • /
    • 2018
  • The near real-time speckle masking reconstruction technique has been developed to accelerate the processing of solar images to achieve high resolutions for ground-based solar telescopes. However, the reconstruction of solar subimages in such a speckle reconstruction is very time-consuming. We design and implement a new parallel speckle masking reconstruction algorithm based on the Compute Unified Device Architecture (CUDA) on General Purpose Graphics Processing Units (GPGPU). Tests are performed to validate the correctness of our program on NVIDIA GPGPU. Details of several parallel reconstruction steps are presented, and the parallel implementation between various modules shows a significant speed increase compared to the previous serial implementations. In addition, we present a comparison of runtimes across serial programs, the OpenMP-based method, and the new parallel method. The new parallel method shows a clear advantage for large scale data processing, and a speedup of around 9 to 10 is achieved in reconstructing one solar subimage of $256{\times}256pixels$. The speedup performance of the new parallel method exceeds that of OpenMP-based method overall. We conclude that the new parallel method would be of value, and contribute to real-time reconstruction of an entire solar image.

Parallel Implementations of Digital Focus Indices Based on Minimax Search Using Multi-Core Processors

  • HyungTae, Kim;Duk-Yeon, Lee;Dongwoon, Choi;Jaehyeon, Kang;Dong-Wook, Lee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.2
    • /
    • pp.542-558
    • /
    • 2023
  • A digital focus index (DFI) is a value used to determine image focus in scientific apparatus and smart devices. Automatic focus (AF) is an iterative and time-consuming procedure; however, its processing time can be reduced using a general processing unit (GPU) and a multi-core processor (MCP). In this study, parallel architectures of a minimax search algorithm (MSA) are applied to two DFIs: range algorithm (RA) and image contrast (CT). The DFIs are based on a histogram; however, the parallel computation of the histogram is conventionally inefficient because of the bank conflict in shared memory. The parallel architectures of RA and CT are constructed using parallel reduction for MSA, which is performed through parallel relative rating of the image pixel pairs and halved the rating in every step. The array size is then decreased to one, and the minimax is determined at the final reduction. Kernels for the architectures are constructed using open source software to make it relatively platform independent. The kernels are tested in a hexa-core PC and an embedded device using Lenna images of various sizes based on the resolutions of industrial cameras. The performance of the kernels for the DFIs was investigated in terms of processing speed and computational acceleration; the maximum acceleration was 32.6× in the best case and the MCP exhibited a higher performance.

Efficient Implementation of Convolutional Neural Network Using CUDA (CUDA를 이용한 Convolutional Neural Network의 효율적인 구현)

  • Ki, Cheol-Min;Cho, Tai-Hoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.6
    • /
    • pp.1143-1148
    • /
    • 2017
  • Currently, Artificial Intelligence and Deep Learning are rising as hot social issues, and these technologies are applied to various fields. A good method among the various algorithms in Artificial Intelligence is Convolutional Neural Networks. Convolutional Neural Network is a form that adds Convolution Layers to Multi Layer Neural Network. If you use Convolutional Neural Networks for small amount of data, or if the structure of layers is not complicated, you don't have to pay attention to speed. But the learning should take long time when the size of the learning data is large and the structure of layers is complicated. In these cases, GPU-based parallel processing is frequently needed. In this paper, we developed Convolutional Neural Networks using CUDA, and show that its learning is faster and more efficient than learning using some other frameworks or programs.

Adaptive Load Balancing Scheme using a Combination of Hierarchical Data Structures and 3D Clustering for Parallel Volume Rendering on GPU Clusters (계층 자료구조의 결합과 3차원 클러스터링을 이용하여 적응적으로 부하 균형된 GPU-클러스터 기반 병렬 볼륨 렌더링)

  • Lee Won-Jong;Park Woo-Chan;Han Tack-Don
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.1-14
    • /
    • 2006
  • Sort-last parallel rendering using a cluster of GPUs has been widely used as an efficient method for visualizing large- scale volume datasets. The performance of this method is constrained by load balancing when data parallelism is included. In previous works static partitioning could lead to self-balance when only task level parallelism is included. In this paper, we present a load balancing scheme that adapts to the characteristic of volume dataset when data parallelism is also employed. We effectively combine the hierarchical data structures (octree and BSP tree) in order to skip empty regions and distribute workload to corresponding rendering nodes. Moreover, we also exploit a 3D clustering method to determine visibility order and save the AGP bandwidths on each rendering node. Experimental results show that our scheme can achieve significant performance gains compared with traditional static load distribution schemes.

Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL

  • Hong, Jung-Hyun;Park, Joo-Yul;Chung, Ki-Seok
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.6
    • /
    • pp.2648-2668
    • /
    • 2016
  • Modern mobile devices are equipped with various accelerated processing units to handle computationally intensive applications; therefore, Open Computing Language (OpenCL) has been proposed to fully take advantage of the computational power in heterogeneous systems. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes on an embedded heterogeneous platform using an OpenCL framework. The LDPC code is one of the most popular and strongest error correcting codes for mobile communication systems. Each step of LDPC decoding has different parallelization characteristics. In the proposed LDPC decoder, steps suitable for task-level parallelization are executed on the multi-core central processing unit (CPU), and steps suitable for data-level parallelization are processed by the graphics processing unit (GPU). To improve the performance of OpenCL kernels for LDPC decoding operations, explicit thread scheduling, vectorization, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance and high power efficiency by using heterogeneous multi-core processors on a unified computing framework.

Redundant Parallel Hopfield Network Configurations: A New Approach to the Two-Dimensional Face Recognitions (병렬 다중 홉 필드 네트워크 구성으로 인한 2-차원적 얼굴인식 기법에 대한 새로운 제안)

  • Kim, Yong Taek;Deo, Kiatama
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.2
    • /
    • pp.63-68
    • /
    • 2018
  • Interests in face recognition area have been increasing due to diverse emerging applications. Face recognition algorithm from a two-dimensional source could be challenging in dealing with some circumstances such as face orientation, illuminance degree, face details such as with/without glasses and various expressions, like, smiling or crying. Hopfield Network capabilities have been used specially within the areas of recalling patterns, generalizations, familiarity recognitions and error corrections. Based on those abilities, a specific experimentation is conducted in this paper to apply the Redundant Parallel Hopfield Network on a face recognition problem. This new design has been experimentally confirmed and tested to be robust in any kind of practical situations.

An Analytical Evaluation of 2D Mesh-connected SIMD Architecture for Parallel Matrix Multiplication (2D Mesh SIMD 구조에서의 병렬 행렬 곱셈의 수치적 성능 분석)

  • Kim, Cheong-Ghil
    • Journal of The Institute of Information and Telecommunication Facilities Engineering
    • /
    • v.10 no.1
    • /
    • pp.7-13
    • /
    • 2011
  • Matrix multiplication is a fundamental operation of linear algebra and arises in many areas of science and engineering. This paper introduces an efficient parallel matrix multiplication scheme on N ${\times}$ N mesh-connected SIMD array processor, called multiple hierarchical SIMD architecture (HMSA). The architectural characteristic of HMSA is the hierarchically structured control units which consist of a global control unit, N local control units configured diagonally, and $N^2$ processing elements (PEs) arranged in an N ${\times}$ N array. PEs are communicating through local buses connecting four adjacent neighbor PEs in mesh-torus networks and global buses running across the rows and columns called horizontal buses and vertical buses, respectively. This architecture enables HMSA to have the features of diagonally indexed concurrent broadcast and the accessibility to either rows (row control mode) or columns (column control mode) of 2D array PEs alternately. An algorithmic mapping method is used for performance evaluation by mapping matrix multiplication on the proposed architecture. The asymptotic time complexities of them are evaluated and the result shows that paralle matrix multiplication on HMSA can provide significant performance improvement.

  • PDF

A study on game physics engine focused on real time physics (물리 엔진에 관한 고찰 : 실시간 물리 기술을 중심으로)

  • Ha, You-Jong;Park, Kyoung-Ju
    • Journal of Korea Game Society
    • /
    • v.9 no.5
    • /
    • pp.43-52
    • /
    • 2009
  • This paper analyzes the four game physics engines in terms of real time techniques. Real time physics is the technology that simplifies the physics-based simulation to apply for the real time applications such as game. Our study includes two commercial physics engines, Havok's Physics SDK and NVIDIA's PhysX SDK, and two open source projects, Open Dynamics Engine and Bullet physics engine. As a result, most of them covers rigid body dynamics and some include either deformable body simulation or fluids simulation, or both. For real time simulation, they adopt the simplified numerical methods, the effective in collision detection/response, and also use the parallel processing hardwares, i.e., multi core CPU, Physics processing unit(PPU), or graphics processing unit(GPU).

  • PDF

Acceleration of Anisotropic Elastic Reverse-time Migration with GPUs (GPU를 이용한 이방성 탄성 거꿀 참반사 보정의 계산가속)

  • Choi, Hyungwook;Seol, Soon Jee;Byun, Joongmoo
    • Geophysics and Geophysical Exploration
    • /
    • v.18 no.2
    • /
    • pp.74-84
    • /
    • 2015
  • To yield physically meaningful images through elastic reverse-time migration, the wavefield separation which extracts P- and S-waves from reconstructed vector wavefields by using elastic wave equation is prerequisite. For expanding the application of the elastic reverse-time migration to anisotropic media, not only the anisotropic modelling algorithm but also the anisotropic wavefield separation is essential. The anisotropic wavefield separation which uses pseudo-derivative filters determined according to vertical velocities and anisotropic parameters of elastic media differs from the Helmholtz decomposition which is conventionally used for the isotropic wavefield separation. Since applying these pseudo-derivative filter consumes high computational costs, we have developed the efficient anisotropic wavefield separation algorithm which has capability of parallel computing by using GPUs (Graphic Processing Units). In addition, the highly efficient anisotropic elastic reverse-time migration algorithm using MPI (Message-Passing Interface) and incorporating the developed anisotropic wavefield separation algorithm with GPUs has been developed. To verify the efficiency and the validity of the developed anisotropic elastic reverse-time migration algorithm, a VTI elastic model based on Marmousi-II was built. A synthetic multicomponent seismic data set was created using this VTI elastic model. The computational speed of migration was dramatically enhanced by using GPUs and MPI and the accuracy of image was also improved because of the adoption of the anisotropic wavefield separation.