Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm

Park, Taejung;

doi:10.9728/dcs.2017.18.4.739

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Volume 18 Issue 4
/
Pages.739-745
/
2017
/
1598-2009(pISSN)
/
2287-738X(eISSN)

Digital Contents Society (한국디지털콘텐츠학회)

DOI QR Code

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm

병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화

Park, Taejung (Department of Digital Media, Duksung Women's University)

박태정 (덕성여자대학교 디지털미디어학과)

Received : 2017.07.11
Accepted : 2017.07.28
Published : 2017.07.31

https://doi.org/10.9728/dcs.2017.18.4.739 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

본 논문에서는 GPU 병렬 처리 하드웨어 아키텍처 내 최소 물리적 스레드 실행 단위(warp) 내에서 shifted sort 기반 k개 최근접 이웃 검색 기법을 구현하는 방법을 논의하고 일반적으로 동일한 목적으로 널리 사용되는 GPU 기반 kd-tree 및 CPU 기반 ANN 라이브러리와 비교한 결과를 제시한다. 또한 많은 애플리케이션에서 k가 비교적 작은 값이 필요한 경우가 많다는 사실을 고려해서 k가 warp 내부에서 직접 처리 가능한 2, 4, 8, 16개일 때 최적화에 집중한다. 구현 세부에서는 사용한 CUB 공개 라이브러리의 루프 내 메모리 관리 방법, GPU 하드웨어 직접 명령 적용 방법 등의 최적화 방법을 논의한다. 실험 결과, 제안하는 방법은 기존의 GPU 기반 유사 방법에 비해 데이터 지점과 질의 지점의 개수가 각각 $2^{23}$개 일 때 16배 이상의 빠른 처리 속도를 보였으며 이러한 경향은 처리해야 할 데이터의 크기가 커지면 더욱 더 커지는 것으로 판단된다.

Keywords

References

S. Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting," In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGGH-HPG'12), Switzerland, pp. 39-47, 2012.
T. M. Chan, "Approximate nearest neighbor queries revisited," In Proceedings of the Thirteenth Annual Symposium on Computational Geometry (SCG '97), New York, pp. 352-358, 1997.
T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables," The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. https://doi.org/10.9728/dcs.2016.17.3.165
J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1st ed. Wrox, pp. 84-87, 2014.
CUDA C Programming guide. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4jURrxaId
CUB official site. Available: https://nvlabs.github.io/cub/#sec1
CUDA Toolkit documentation. Available: http://docs.nvidia.com/cuda/
T. Park, "Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture," The Journal of Digital Contents Society, Vol. 18, No. 3, pp. 613-619, June 2017. https://doi.org/10.9728/DCS.2017.18.3.613
Mark Harris, GPU Gems 3, ch. 39. "Parallel Prefix Sum (Scan) with CUDA". Available: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
Shuffle: Tips and Tricks, GPU Technology Conference material. Available: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf
ANN: A Library for Approximate Nearest Neighbor Searching website. Available: https://www.cs.umd.edu/-mount/ANN/
T. Park, "Implementation and Analysis of Parallel kd-Tree Based on Binary Radix Tree with OptiX Realtime Raytracing Framework for Collision Detection and Realtime Raytracing", Korean Society for Computer Game, vol. 27, No. 3, pp. 53-60, September 2014.

Journal of Digital Contents Society (디지털콘텐츠학회 논문지)

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm

병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)