Search | Korea Science

Three-dimensional Wave Propagation Modeling using OpenACC and GPU (OpenACC와 GPU를 이용한 3차원 파동 전파 모델링)

Kim, Ahreum;Lee, Jongwoo;Ha, Wansoo
- Geophysics and Geophysical Exploration
- /
- v.20 no.2
- /
- pp.72-77
- /
- 2017
We calculated 3D frequency- and Laplace-domain wavefields using time-domain modeling and Fourier transform or Laplace transform. We adopted OpenACC and GPU for an efficient parallel calculation. The OpenACC makes it easy to use GPU accelerators by adding directives in conventional C, C++, and Fortran programming languages. Accordingly, one doesn't have to learn new GPGPU programming languages such as CUDA or OpenCL to use GPU. An OpenACC program allocates GPU memory, transfers data between the host CPU and GPU devices and performs GPU operations automatically or following user-defined directives. We compared performance of 3D wave propagation modeling programs using OpenACC and GPU to that using single-core CPU through numerical tests. Results using a homogeneous model and the SEG/EAGE salt model show that the OpenACC programs are approximately 53 and 30 times faster than those using single-core CPU.
https://doi.org/10.7582/GGE.2017.20.2.072 인용 PDF KSCI

Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications (범용 응용프로그램 실행 시 하드웨어 구성과 분기 처리 기법에 따른 GPU 성능 분석)

Choi, Hong Jun;Kim, Cheol Hong
- The Journal of the Korea Contents Association
- /
- v.13 no.3
- /
- pp.9-21
- /
- 2013
Due to increased computing power and flexibility of GPU, recent GPUs execute general purpose parallel applications as well as graphics applications. Programmers can use GPGPU by using the APIs from GPU vendors. Unfortunately, computational resources of GPU are not fully utilized when executing general purpose applications because of frequent branch instructions. To handle the branch problem, several warp formations have been proposed. Intuitively, we expect that the warp formations providing higher computational resource utilization show higher performance. Contrary to our expectations, according to simulation results, the performance of the warp formation providing better utilization is lower than that of the warp formation providing worse utilization. This is because warp formation providing high utilization causes serious memory bottleneck due to increased memory request. Therefore, warp formation providing high computation utilization cannot guarantee high performance without proper hardware resources. For this reason, we will analyze the correlation between hardware configuration and warp formation. Our simulation results present the guideline to solve the underutilization problem due to branch instructions when designing recent GPU.
https://doi.org/10.5392/JKCA.2013.13.03.009 인용 PDF KSCI

An Analytical Model for Performance Prediction of AES on GPU Architecture (GPU 아키텍처의 AES 암호화 성능 예측 분석 모델)

Kim, Kyuwoon;Kim, Hyunwoo;Kim, Huijeong;Huh, Taeyoung;Jung, Sanghyuk;Song, Yong Ho
- Journal of the Institute of Electronics and Information Engineers
- /
- v.50 no.4
- /
- pp.89-96
- /
- 2013
The graphic processor unit (GPU) has been developed to process not only graphic data but also general system data. It shows a better performance than CPU in algorithm for 3D graphics and parallel program. In order to execute algorithm for CPU on GPU, we should understand about GPU architectures and rewrite program considering parallel processing capability and new memory model of GPU. For this reasons, a performance prediction model for the algorithm and its predicted performance through GPU system are required. These can predict problems in GPU application development or construct a performance evaluation standard for GPU. In this paper, we applied the AES encryption algorithms on our performance model and accomplished performance prediction with high accuracy under a heavy workload.
https://doi.org/10.5573/ieek.2013.50.4.089 인용 PDF KSCI

EFFICIENT COMPUTATION OF COMPRESSIBLE FLOW BY HIGHER-ORDER METHOD ACCELERATED USING GPU (고차 정확도 수치기법의 GPU 계산을 통한 효율적인 압축성 유동 해석)

Chang, T.K.;Park, J.S.;Kim, C.
- Journal of computational fluids engineering
- /
- v.19 no.3
- /
- pp.52-61
- /
- 2014
The present paper deals with the efficient computation of higher-order CFD methods for compressible flow using graphics processing units (GPU). The higher-order CFD methods, such as discontinuous Galerkin (DG) methods and correction procedure via reconstruction (CPR) methods, can realize arbitrary higher-order accuracy with compact stencil on unstructured mesh. However, they require much more computational costs compared to the widely used finite volume methods (FVM). Graphics processing unit, consisting of hundreds or thousands small cores, is apt to massive parallel computations of compressible flow based on the higher-order CFD methods and can reduce computational time greatly. Higher-order multi-dimensional limiting process (MLP) is applied for the robust control of numerical oscillations around shock discontinuity and implemented efficiently on GPU. The program is written and optimized in CUDA library offered from NVIDIA. The whole algorithms are implemented to guarantee accurate and efficient computations for parallel programming on shared-memory model of GPU. The extensive numerical experiments validates that the GPU successfully accelerates computing compressible flow using higher-order method.
https://doi.org/10.6112/kscfe.2014.19.3.052 인용 PDF KSCI

Design of Virtual Machine for Vertex Shader (정점 셰이더의 가상 기계 구현)

Ha, Chang-Soo;Kim, Ju-Hong;Choi, Byeong-Yoon
- Proceedings of the IEEK Conference
- /
- 2005.11a
- /
- pp.1003-1006
- /
- 2005
Vertex shader of GPU in personal computer is advanced in functions as to be half of traditional fixed T&L functions. And, capacity of memory for saving resources to process instructions is unlimited. GPU that can be programmed by programmer is needed for mobile system as well as personal computer. In this paper, we implement software virtual machine for vertex shader using C++ Language. Our goal is designing hardware GPU that can apply to mobile system. The virtual machine consists of nVidia GPU instructions. Input Data to virtual machine is generated by Microsoft fxc compiler. That is to say, Input Data is compiled shader program written in HLSL, Cg, or ASM. The virtual machine will be a reference model for designing hardware GPU and can be used for Testbed to test added or modified instruction.
PDF

Analyzing problem of job failures due to low GPU memory when concurrent running inference jobs in a container environment (컨테이너 환경에서 추론 작업 동시 실행 시 GPU 메모리 부족으로 인한 작업 실패 문제 분석)

HyungJun Kim;Jihun Kang
- Proceedings of the Korea Information Processing Society Conference
- /
- 2023.11a
- /
- pp.71-74
- /
- 2023
인공지능의 추론 작업은 대규모 연산 자원을 필요로 하는 학습 작업과는 다르게 단일 서버에서 다수의 작업을 동시 실행하는 것이 가능하며, 실행 시간이 상대적으로 빠르다는 특성으로 인해 작업 실행을 위해 컴퓨팅 자원을 점유하고 빠르게 작업을 완료한 후 자원을 반환하기 때문에 다수의 추론 작업을 동시에 운용하는데 용이하다. 하지만, 단일 서버의 컴퓨팅 자원은 제한적이다. 이로 인해 컴퓨팅 자원의 허용 범위 내에서 작업을 운용해야 하며, 허용 범위를 초과하는 규모의 추론 작업이 동시에 실행되면 자원 부족으로 인한 경쟁이 발생한다. 본 논문에서는 컨테이너 환경에서 다수의 추론 작업이 동시에 실행될 때 GPU 메모리 부족으로 인한 작업 실패 문제를 실험을 통해 확인한다. 또한, 다수의 추론 작업 사이에서 발생하는 GPU 자원 경쟁과 실행을 실패하는 추론 작업의 GPU 메모리 낭비로 인한 자원 활용률 저하 문제를 분석한다.
https://doi.org/10.3745/PKIPS.y2023m11a.71 인용 PDF

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs (GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- KIPS Transactions on Computer and Communication Systems
- /
- v.8 no.5
- /
- pp.111-118
- /
- 2019
Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.
https://doi.org/10.3745/KTCCS.2019.8.5.111 인용 PDF KSCI HTML

Speed-optimized Implementation of HIGHT Block Cipher Algorithm (HIGHT 블록 암호 알고리즘의 고속화 구현)

Baek, Eun-Tae;Lee, Mun-Kyu
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.22 no.3
- /
- pp.495-504
- /
- 2012
This paper presents various speed optimization techniques for software implementation of the HIGHT block cipher on CPUs and GPUs. We considered 32-bit and 64-bit operating systems for CPU implementations. After we applied the bit-slicing and byte-slicing techniques to HIGHT, the encryption speed recorded 1.48Gbps over the intel core i7 920 CPU with a 64-bit operating system, which is up to 2.4 times faster than the previous implementation. We also implemented HIGHT on an NVIDIA GPU equipped with CUDA, and applied various optimization techniques, such as storing most frequently used data like subkeys and the F lookup table in the shared memory; and using coalesced access when reading data from the global memory. To our knowledge, this is the first result that implements and optimizes HIGHT on a GPU. We verified that the byte-slicing technique guarantees a speed-up of more than 20%, resulting a speed which is 31 times faster than that on a CPU.
https://doi.org/10.13089/JKIISC.2012.22.3.495 인용 PDF KSCI HTML

Parallel Range Query Processing with R-tree on Multi-GPUs (다중 GPU를 이용한 R-tree의 병렬 범위 질의 처리 기법)

Ryu, Hongsu;Kim, Mincheol;Choi, Wonik
- Journal of KIISE
- /
- v.42 no.4
- /
- pp.522-529
- /
- 2015
Ever since the R-tree was proposed to index multi-dimensional data, many efforts have been made to improve its query performances. One common trend to improve query performance is to parallelize query processing with the use of multi-core architectures. To this end, a GPU-base R-tree has been recently proposed. However, even though a GPU-based R-tree can exhibit an improvement in query performance, it is limited in its ability to handle large volumes of data because GPUs have limited physical memory. To address this problem, we propose MGR-tree (Multi-GPU R-tree), which can manage large volumes of data by dividing nodes into multiple GPUs. Our experiments show that MGR-tree is up to 9.1 times faster than a sequential search on a GPU and up to 1.6 times faster than a conventional GPU-based R-tree.
https://doi.org/10.5626/JOK.2015.42.4.522 인용 KSCI

Study of Cache Performance on GPGPU

Choi, Kyu Hyun;Kim, Seon Wook
- IEIE Transactions on Smart Processing and Computing
- /
- v.4 no.2
- /
- pp.78-82
- /
- 2015
General-purpose graphics processing units (GPGPUs) provide tremendous computational and processing power. Despite the latency hiding mechanism, a GPU architecture requires high memory bandwidth and lower latency between computational units and the memory system. For this reason, the current GPU architecture has private L1 caches in each core and a shared L2 cache to increase performance by reducing memory latency. But in some cases, this CPU-like cache design is not suitable for GPGPUs. In this paper, we analyze detailed cache performance related to GPGPU application characteristics, and suggest technical alternatives for the GPGPU architecture as future work.
https://doi.org/10.5573/IEIESPC.2015.4.2.078 인용 PDF KSCI

Search Result 149, Processing Time 0.03 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)