• Title/Summary/Keyword: GPU optimization

Search Result 70, Processing Time 0.028 seconds

Accelerating Molecular Dynamics Simulation Using Graphics Processing Unit

  • Myung, Hun-Joo;Sakamaki, Ryuji;Oh, Kwang-Jin;Narumi, Tetsu;Yasuoka, Kenji;Lee, Sik
    • Bulletin of the Korean Chemical Society
    • /
    • v.31 no.12
    • /
    • pp.3639-3643
    • /
    • 2010
  • We have developed CUDA-enabled version of a general purpose molecular dynamics simulation code for GPU. Implementation details including parallelization scheme and performance optimization are described. Here we have focused on the non-bonded force calculation because it is most time consuming part in molecular dynamics simulation. Timing results using CUDA-enabled and CPU versions were obtained and compared for a biomolecular system containing 23558 atoms. CUDA-enabled versions were found to be faster than CPU version. This suggests that GPU could be a useful hardware for molecular dynamics simulation.

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

  • Vo, Viet Tan;Kim, Cheol Hong
    • Journal of Information Processing Systems
    • /
    • v.17 no.6
    • /
    • pp.1157-1169
    • /
    • 2021
  • Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.

A Study on Function which supported GPU and Function Structure Optimization for AI Inference (서버리스 플랫폼에서 GPU 지원 및 인공지능 모델 추론 에 적합한 함수 구조에 관한 연구)

  • Hwang, Dong-Hyun;Kim, Dongmin;Choi, Young-Yoon;Han, Seung-Ho;Jeon, Gi-Man;Son, Jae-Gi
    • Annual Conference of KIPS
    • /
    • 2019.10a
    • /
    • pp.19-20
    • /
    • 2019
  • 서버리스 프레임워크(Serverless Framework)는 마이크로서비스 아키텍처의 이론을 클라우드와 컨테이너를 기반으로 구현한 것으로 아마존의 AWS(Amazon Web Service)와 같은 퍼블릭 클라우드 플랫폼이 서비스됨에 따라 활용도 높아지고 있다. 하지만 현재까지의 플랫폼들은 GPU 와 같은 하드웨어의 의존성을 가진 인공지능 모델의 서비스에는 지원이 부족하다. 이에 본 논문에서는 컨테이너 기반의 오픈소스 서버리스 플랫폼을 대상으로 엔비디어-도커와 k8s-device-plugin 을 적용하여 GPU 활용이 가능한 서버리스 플랫폼을 구현하였다. 또한 인공지능 모델이 컨테이너에서 구동될 때 반복되는 가중치 로드를 줄이기 위한 구조를 제안한다. 본 논문에서 구현된 서버리스 플랫폼은 객체 검출 모델인 SSD(Single Shot Multibox Detector) 모델을 이용하여 성능 비교 실험을 진행하였으며, 그 결과 인공지능 모델이 적용된 서버리스 플랫폼의 함수 응답 시간이 개선되었음을 확인하였다.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.14 no.4
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

High Throughput Parallel KMP Algorithm Considering CPU-GPU Memory Hierarchy (CPU-GPU 메모리 계층을 고려한 고처리율 병렬 KMP 알고리즘)

  • Park, Soeun;Kim, Daehee;Lee, Myungho;Park, Neungsoo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.67 no.5
    • /
    • pp.656-662
    • /
    • 2018
  • Pattern matching algorithm is widely used in many application fields such as bio-informatics, intrusion detection, etc. Among many string matching algorithms, KMP (Knuth-Morris-Pratt) algorithm is commonly used because of its fast execution time when using large texts. However, the processing speed of KMP algorithm is also limited when the text size increases significantly. In this paper, we propose a high throughput parallel KMP algorithm considering CPU-GPU memory hierarchy based on OpenCL in GPGPU (General Purpose computing on Graphic Processing Unit). We focus on the optimization for the allocation of work-times and work-groups, the local memory copy of the pattern data and the failure table, and the overlapping of the data transfer with the string matching operations. The experimental results show that the execution time of the optimized parallel KMP algorithm is about 3.6 times faster than that of the non-optimized parallel KMP algorithm.

An Optimized Iterative Semantic Compression Algorithm And Parallel Processing for Large Scale Data

  • Jin, Ran;Chen, Gang;Tung, Anthony K.H.;Shou, Lidan;Ooi, Beng Chin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.6
    • /
    • pp.2761-2781
    • /
    • 2018
  • With the continuous growth of data size and the use of compression technology, data reduction has great research value and practical significance. Aiming at the shortcomings of the existing semantic compression algorithm, this paper is based on the analysis of ItCompress algorithm, and designs a method of bidirectional order selection based on interval partitioning, which named An Optimized Iterative Semantic Compression Algorithm (Optimized ItCompress Algorithm). In order to further improve the speed of the algorithm, we propose a parallel optimization iterative semantic compression algorithm using GPU (POICAG) and an optimized iterative semantic compression algorithm using Spark (DOICAS). A lot of valid experiments are carried out on four kinds of datasets, which fully verified the efficiency of the proposed algorithm.

The Optimization Mechanism of CPU/GPU Computing Resource for Minimization of Performance Interference and Calculation Efficiency in Volunteer Computing Environment (볼런티어 컴퓨팅 환경에서 성능간섭 최소화와 연산 효율성 증대를 위한 CPU/GPU 컴퓨팅 자원 최적화 기법)

  • Bak, Bong Woo;Song, Chung Geon;Yu, Heon Chang
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.12
    • /
    • pp.479-486
    • /
    • 2017
  • Volunteer computing is a new computing paradigm that performs operations on idle resources of many nodes. The operation method of the client application for the execution of the volunteer computing is determined by the setting information of the user. Ideal operation requires optimized settings for system features and operating methods of other applications. In this paper, we analyze the usage ratio of CPU and GPU periodically, and develop a manager that dynamically applies optimized options. Through our proposed mechanism, the performance of the task computing is higher than that of the existing Volunteer Computing, and the performance interference is minimized. It is expected that volunteers will be able to provide higher computing resources for Volunteer Computing Project.

PDF Version 1.4-1.6 Password Cracking in CUDA GPU Environment (PDF 버전 1.4-1.6의 CUDA GPU 환경에서 암호 해독 최적 구현)

  • Hyun Jun, Kim;Si Woo, Eum;Hwa Jeong, Seo
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.2
    • /
    • pp.69-76
    • /
    • 2023
  • Hundreds of thousands of passwords are lost or forgotten every year, making the necessary information unavailable to legitimate owners or authorized law enforcement personnel. In order to recover such a password, a tool for password cracking is required. Using GPUs instead of CPUs for password cracking can quickly process the large amount of computation required during the recovery process. This paper optimizes on GPUs using CUDA, with a focus on decryption of the currently most popular PDF 1.4-1.6 version. Techniques such as eliminating unnecessary operations of the MD5 algorithm, implementing 32-bit word integration of the RC4 algorithm, and using shared memory were used. In addition, autotune techniques were used to search for the number of blocks and threads that affect performance improvement. As a result, we showed throughput of 31,460 kp/s (kilo passwords per second) and 66,351 kp/s at block size 65,536, thread size 96 in RTX 3060, RTX 3090 environments, and improved throughput by 22.5% and 15.2%, respectively, compared to the cracking tool hashcat that achieves the highest throughput.

Dorsal Hand Vein Identification Based on Binary Particle Swarm Optimization

  • Benziane, Sarah Hachemi;Benyettou, Abdelkader
    • Journal of Information Processing Systems
    • /
    • v.13 no.2
    • /
    • pp.268-284
    • /
    • 2017
  • The dorsal hand vein biometric system developed has a main objective and specific targets; to get an electronic signature using a secure signature device. In this paper, we present our signature device with its different aims; respectively: The extraction of the dorsal veins from the images that were acquired through an infrared device. For each identification, we need the representation of the veins in the form of shape descriptors, which are invariant to translation, rotation and scaling; this extracted descriptor vector is the input of the matching step. The optimization decision system settings match the choice of threshold that allows accepting/rejecting a person, and selection of the most relevant descriptors, to minimize both FAR and FRR errors. The final decision for identification based descriptors selected by the PSO hybrid binary give a FAR =0% and FRR=0% as results.

GPU-Accelerated Single Image Depth Estimation with Color-Filtered Aperture

  • Hsu, Yueh-Teng;Chen, Chun-Chieh;Tseng, Shu-Ming
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.3
    • /
    • pp.1058-1070
    • /
    • 2014
  • There are two major ways to implement depth estimation, multiple image depth estimation and single image depth estimation, respectively. The former has a high hardware cost because it uses multiple cameras but it has a simple software algorithm. Conversely, the latter has a low hardware cost but the software algorithm is complex. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera. In this paper, we present an implementation of depth estimation with a single image using a graphics processing unit (GPU) in a desktop PC, and achieve real-time application via our evolutional algorithm and parallel processing technique, employing a compute shader. The methods greatly accelerate the compute-intensive implementation of depth estimation with a single view image from 0.003 frames per second (fps) (implemented in MATLAB) to 53 fps, which is almost twice the real-time standard of 30 fps. In the previous literature, to the best of our knowledge, no paper discusses the optimization of depth estimation using a single image, and the frame rate of our final result is better than that of previous studies using multiple images, whose frame rate is about 20fps.