Search | Korea Science

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
- Journal of KIISE:Computing Practices and Letters
- /
- v.16 no.7
- /
- pp.775-787
- /
- 2010
Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.
PDF KSCI

Parallel Computation for Extended Edit Distances Using the Shared Memory on GPU (GPU의 공유메모리를 활용한 확장편집거리 병렬계산)

Kim, Youngho;Na, Joong Chae;Sim, Jeong Seop
- KIPS Transactions on Computer and Communication Systems
- /
- v.4 no.7
- /
- pp.213-218
- /
- 2015
Given two strings X and Y (|X|=m, |Y|=n) over an alphabet ${\Sigma}$, the extended edit distance between X and Y can be computed using dynamic programming in O(mn) time and space. Recently, a parallel algorithm that takes O(m+n) time and O(mn) space using m threads to compute the extended edit distance between X and Y was presented. In this paper, we present an improved parallel algorithm using the shared memory on GPU. The experimental results show that our parallel algorithm runs about 19~25 times faster than the previous parallel algorithm.
https://doi.org/10.3745/KTCCS.2015.4.7.213 인용 PDF KSCI

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture (GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현)

Park, Taejung
- Journal of Digital Contents Society
- /
- v.18 no.3
- /
- pp.613-619
- /
- 2017
This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.
https://doi.org/10.9728/dcs.2017.18.3.613 인용 PDF KSCI

Improving the Performance of Document Similarity by using GPU Parallelism (GPU 병렬성을 이용한 문서 유사도 계산 성능 개선)

Park, Il-Nam;Bae, Byung-Gurl;Im, Eun-Jin;Kang, Seung-Shik
- The KIPS Transactions:PartB
- /
- v.19B no.4
- /
- pp.243-248
- /
- 2012
In the information retrieval systems like vector model implementation and document clustering, document similarity calculation takes a great part on the overall performance of the system. In this paper, GPU parallelism has been explored to enhance the processing speed of document similarity calculation in a CUDA framework. The proposed method increased the similarity calculation speed almost 15 times better compared to the typical CPU-based framework. It is 5.2 and 3.4 times better than the methods by using CUBLAS and Thrust, respectively.
https://doi.org/10.3745/KIPSTB.2012.19B.4.243 인용 PDF KSCI

A Parallel Algorithm for Measuring Graph Similarity Using CUDA on GPU (GPU에서 CUDA를 이용한 그래프 유사도 측정을 위한 병렬 알고리즘)

Son, Min-Young;Kim, Young-Hak;Choi, Sung-Ja
- KIISE Transactions on Computing Practices
- /
- v.23 no.3
- /
- pp.156-164
- /
- 2017
Measuring the similarity of two graphs is a basic tool to solve graph problems in various applications. Most graph algorithms have a high time complexity according to the number of vertices and edges. Because Graphics Processing Units (GPUs) have a high computational power and can be obtained at a low cost, these have been widely used in graph applications to improve execution time. This paper proposes an efficient parallel algorithm to measure graph similarity using the CUDA on a GPU environment. The experimental results show that the proposed approach brings a considerable improvement in performance and efficiency when compared to CPU-based results. Our results also show that the performance is improved significantly as the size of the graph increases.
https://doi.org/10.5626/KTCP.2017.23.3.156 인용 KSCI

Multiview Stereo Matching on Mobile Devices Using Parallel Processing on Embedded GPU (임베디드 GPU에서의 병렬처리를 이용한 모바일 기기에서의 다중뷰 스테레오 정합)

Jeon, Yun Bae;Park, In Kyu
- Journal of Broadcast Engineering
- /
- v.24 no.6
- /
- pp.1064-1071
- /
- 2019
Multiview stereo matching algorithm is used to reconstruct 3D shape from a set of 2D images. Conventional multiview stereo algorithms have been implemented on high-performance hardware due to the heavy complexity that contains a large number of calculations in each step. However, as the performance of mobile graphics processors has recently increased rapidly, complex computer vision algorithms can now be implemented on mobile devices like a smartphone and an embedded board. In this paper we parallelize an multiview stereo algorithm using OpenCL on mobile GPU and provide various optimization techniques on the embedded hardware with limited resource.
https://doi.org/10.5909/JBE.2019.24.6.1064 인용 PDF KSCI KPUBS

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs (GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러)

Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
- KIPS Transactions on Computer and Communication Systems
- /
- v.8 no.5
- /
- pp.111-118
- /
- 2019
Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.
https://doi.org/10.3745/KTCCS.2019.8.5.111 인용 PDF KSCI HTML

Fast Depth Map Estimation using Parallel Processing based on GPU (GPU기반 Depth Map 회득을 위한 고속 병렬처리 기법)

Jin, Moon-Sub;Choi, Ji-Yoon;Choo, Hyon-Gon;Kim, Jin-Woong;Park, Jong-Il
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2011.07a
- /
- pp.396-398
- /
- 2011
본 논문은 두 대의 카메라와 한 대의 프로젝터로 구성된 Pro-cam시스템을 이용하여, 출력된 패턴 영상을 카메라로 촬영하고 이를 기반으로 Depth Map을 계산하는 모듈의 실시간 처리를 위한 GPU기반 병렬처리 기법을 제안한다. 입력받은 영상으로부터 구조광의 패턴을 해석하고, Depth Map을 계산하기 위해서, Dynamic pattern decoding하는 과정은 프로젝터의 패턴영상과 촬영된 카메라 패턴영상 간의 관계를 반복적으로 비교하므로, 이를 GPU 프로그래밍을 이용하여 병렬 처리를 통해 고속화하였다. 결과적으로 본 논문에서는 기존 CPU에서 수행했던 속도에 비해 약 18배정도 속도를 개선 할 수 있었다.
PDF

Time Measurement on GPU-based LCTM Simulation (GPU 기반 LCTM 교통 시뮬레이션에서의 성능 측정)

KYUNG, MinGi;Shin, In-soo;Cho, Min-Kyu;Min, Dugki
- Proceedings of the Korea Information Processing Society Conference
- /
- 2019.10a
- /
- pp.141-143
- /
- 2019
본 연구에서는 메소스코픽 교통 시뮬레이션 모델의 하나인 LCTM(Lane Cell Transmission Model) 모델을 GPU 기반의 병렬 교통 시뮬레이션의 형태로 구현하여, 수행한 시뮬레이션 시간을 측정하였다. 본 논문에서는 LCTM 교통 시뮬레이션의 병렬화 고려사항들을 언급하고, GPU 를 사용한 병렬 교통 시뮬레이션 구현 시, 성능에 영향을 미치는 요소들을 분석한 후, 측정하였다.
https://doi.org/10.3745/PKIPS.y2019m10a.141 인용 PDF

Parallel Self-Collision Detection for Large 3D Mesh Model using GPU (GPU를 이용한 대용량 3D 메쉬 모델에 대한 병렬 자체 충돌검사)

Park, Sung-Hun;Kim, Yangen;Choi, Yoo-Joo
- Proceedings of the Korea Information Processing Society Conference
- /
- 2022.05a
- /
- pp.708-711
- /
- 2022
본 논문은 3D 프린팅 출력 성공률을 높이기 위해 GPU를 이용한 대용량 3D 메쉬 모델에 대한 병렬 자체충돌 검사 방법을 제안한다. 강인하고 견고한 자체 충돌 검사를 위해 분리축 검사, 삼각형-삼각형 교차 검사, 메쉬 연결성 검사, 대용량 메쉬를 위한 분할 처리 기법의 절차를 제안한다. 이러한 자체 충돌 검사를 빠르게 수행하기 위하여 GPU 기반 병렬처리 구현 방법을 제시한다.
https://doi.org/10.3745/PKIPS.y2022m05a.708 인용 PDF

Search Result 312, Processing Time 0.022 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)