• Title/Summary/Keyword: Parallel GPU

Search Result 284, Processing Time 0.033 seconds

Parallel Self-Collision Detection for Large 3D Mesh Model using GPU (GPU를 이용한 대용량 3D 메쉬 모델에 대한 병렬 자체 충돌검사)

  • Park, Sung-Hun;Kim, Yangen;Choi, Yoo-Joo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.05a
    • /
    • pp.708-711
    • /
    • 2022
  • 본 논문은 3D 프린팅 출력 성공률을 높이기 위해 GPU를 이용한 대용량 3D 메쉬 모델에 대한 병렬 자체충돌 검사 방법을 제안한다. 강인하고 견고한 자체 충돌 검사를 위해 분리축 검사, 삼각형-삼각형 교차 검사, 메쉬 연결성 검사, 대용량 메쉬를 위한 분할 처리 기법의 절차를 제안한다. 이러한 자체 충돌 검사를 빠르게 수행하기 위하여 GPU 기반 병렬처리 구현 방법을 제시한다.

Parallelization and Performance Optimization of the Boyer-Moore Algorithm on GPU (Boyer-Moore 알고리즘을 위한 GPU상에서의 병렬 최적화)

  • Jeong, Yosang;Tran, Nhat-Phuong;Lee, Myungho;Nam, Dukyun;Kim, Jik-Soo;Hwang, Soonwook
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.2
    • /
    • pp.138-143
    • /
    • 2015
  • The Boyer-Moore algorithm is a single pattern string matching algorithm that is widely used in various applications such as computer and internet security, and bioinformatics. This algorithm is computationally demanding and requires high-performance parallel processing. In this paper, we propose a parallelization and performance optimization methodology for the BM algorithm on a GPU. Our methodology adopts an algorithmic cascading technique. This results in significant reductions in the mapping overheads for the threads participating in the parallel string matching. It also results in the efficient utilization of the multithreading capability of the GPU which improves the load balancing among threads. Our experimental results show that this approach achieves a 45-times speedup at maximum, in comparison with a serial execution.

GLSL based Additional Learning Nearest Neighbor Algorithm suitable for Locating Unpaved Road (추가 학습이 빈번히 필요한 비포장도로에서 주행로 탐색에 적합한 GLSL 기반 ALNN Algorithm)

  • Ku, Bon Woo;Kim, Jun kyum;Rhee, Eun Joo
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.12 no.1
    • /
    • pp.29-36
    • /
    • 2019
  • Unmanned Autonomous Vehicle's driving road in the national defense includes not only paved roads, but also unpaved roads which have rough and unexpected changes. This Unmanned Autonomous Vehicles monitor and recon rugged or remote areas, and defend own position, they frequently encounter environments roads of various and unpredictable. Thus, they need additional learning to drive in this environment, we propose a Additional Learning Nearest Neighbor (ALNN) which is modified from Approximate Nearest Neighbor to allow for quick learning while avoiding the 'Forgetting' problem. In addition, since the Execution speed of the ALNN algorithm decreases as the learning data accumulates, we also propose a solution to this problem using GPU parallel processing based on OpenGL Shader Language. The ALNN based on GPU algorithm can be used in the field of national defense and other similar fields, which require frequent and quick application of additional learning in real-time without affecting the existing learning data.

Efficient Workload Distribution of Photomosaic Using OpenCL into a Heterogeneous Computing Environment (이기종 컴퓨팅 환경에서 OpenCL을 사용한 포토모자이크 응용의 효율적인 작업부하 분배)

  • Kim, Heegon;Sa, Jaewon;Choi, Dongwhee;Kim, Haelyeon;Lee, Sungju;Chung, Yongwha;Park, Daihee
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.4 no.8
    • /
    • pp.245-252
    • /
    • 2015
  • Recently, parallel processing methods with accelerator have been introduced into a high performance computing and a mobile computing. The photomosaic application can be parallelized by using inherent data parallelism and accelerator. In this paper, we propose a way to distribute the workload of the photomosaic application into a CPU and GPU heterogeneous computing environment. That is, the photomosaic application is parallelized using both CPU and GPU resource with the asynchronous mode of OpenCL, and then the optimal workload distribution rate is estimated by measuring the execution time with CPU-only and GPU-only distribution rates. The proposed approach is simple but very effective, and can be applied to parallelize other applications on a CPU and GPU heterogeneous computing environment. Based on the experimental results, we confirm that the performance is improved by 141% into a heterogeneous computing environment with the optimal workload distribution compared with using GPU-only method.

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units (작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석)

  • Son, Dong Oh;Sim, Gyu Yeon;Kim, Cheol Hong
    • Smart Media Journal
    • /
    • v.4 no.4
    • /
    • pp.56-63
    • /
    • 2015
  • Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.

An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU (Coloring이 적용된 Gauss-Seidel 해법을 통한 CPU와 GPU의 연산 효율에 관한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.41 no.2
    • /
    • pp.117-124
    • /
    • 2017
  • The performance of the colored Gauss-Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss-Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss-Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.

Development of GPU-accelerated kinematic wave model using CUDA fortran (CUDA fortran을 이용한 GPU 가속 운동파모형 개발)

  • Kim, Boram;Park, Seonryang;Kim, Dae-Hong
    • Journal of Korea Water Resources Association
    • /
    • v.52 no.11
    • /
    • pp.887-894
    • /
    • 2019
  • We proposed a GPU (Grapic Processing Unit) accelerated kinematic wave model for rainfall runoff simulation and tested the accuracy and speed up performance of the proposed model. The governing equations are the kinematic wave equation for surface flow and the Green-Ampt model for infiltration. The kinematic wave equations were discretized using a finite volume method and CUDA fortran was used to implement the rainfall runoff model. Several numerical tests were conducted. The computed results of the GPU accelerated kinematic wave model were compared with several measured and other numerical results and reasonable agreements were observed from the comparisons. The speed up performance of the GPU accelerated model increased as the number of grids increased, achieving a maximum speed up of approximately 450 times compared to a CPU (Central Processing Unit) version, at least for the tested computing resources.

A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems (방출단층촬영 시스템을 위한 GPU 기반 반복적 기댓값 최대화 재구성 알고리즘 연구)

  • Ha, Woo-Seok;Kim, Soo-Mee;Park, Min-Jae;Lee, Dong-Soo;Lee, Jae-Sung
    • Nuclear Medicine and Molecular Imaging
    • /
    • v.43 no.5
    • /
    • pp.459-467
    • /
    • 2009
  • Purpose: The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Materials and Methods: Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. Results: The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 see, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 see, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. Conclusion: The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries.

Parallel Intersection Detection Algorithm using CUDA (CUDA 를 이용한 가상 객체들간의 병렬 충돌 검사 알고리즘)

  • Lee, Yeon-Hee;Kim, Young-J.
    • 한국HCI학회:학술대회논문집
    • /
    • 2008.02a
    • /
    • pp.451-455
    • /
    • 2008
  • In this paper, we present how we implement the low-level, triangle intersection routine in a massively parallel fashion using n VIDIA's new GPGPU language, CUDA. Triangle intersection often becomes a computational bottleneck in the collision detection problem. Due to the relatively low bandwidth between CPU and GPU, it has been challenging to implement efficient, object-space collision detection between triangle sets. However, thanks to the improved data transmission rates in CUDA architecture, in this paper, we improved the performance of triangle intersection substantially better than the optimized CPU counterpart.

  • PDF

Implementation of LTE uplink System for SDR Platform using CUDA and UHD (CUDA와 UHD를 이용한 SDR 플랫폼 용 LTE 상향링크 시스템 구현)

  • Ahn, Chi Young;Kim, Yong;Choi, Seung Won
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.9 no.2
    • /
    • pp.81-87
    • /
    • 2013
  • In this paper, we present an implementation of Long Term Evolution (LTE) Uplink (UL) system on a Software Defined Radio (SDR) platform using a conventional Personal Computer (PC), which adopts Graphic Processing Units (GPU) and Universal Software Radio Peripheral2 (USRP2) with URSP Hardware Driver (UHD) for SDR software modem and Radio Frequency (RF) transceiver, respectively. We have adopted UHD because UHD provides flexibility in the design of transceiver chain. Also, Cognitive Radio (CR) engine have been implemented by using libraries from UHD. Meanwhile, we have implemented the software modem in our system on GPU which is suitable for parallel computing due to its powerful Arithmetic and Logic Units (ALUs). From our experiment tests, we have measured the total processing time for a single frame of both transmit and receive LTE UL data to find that it takes about 5.00ms and 6.78ms for transmit and receive, respectively. It particularly means that the implemented system is capable of real-time processing of all the baseband signal processing algorithms required for LTE UL system.