[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.9728/dcs.2017.18.4.739

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm

Park, Taejung (Department of Digital Media, Duksung Women's University)

Publication Information

Journal of Digital Contents Society / v.18, no.4, 2017 , pp. 739-745 More about this Journal

Abstract

This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

Keywords

CUDA; GPGPU; Space-filling Curve; Shifted Sort; k Approximate Nearest Neighbor Search;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	S. Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting," In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGGH-HPG'12), Switzerland, pp. 39-47, 2012.
2	T. M. Chan, "Approximate nearest neighbor queries revisited," In Proceedings of the Thirteenth Annual Symposium on Computational Geometry (SCG '97), New York, pp. 352-358, 1997.
3	T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables," The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. DOI
4	J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1st ed. Wrox, pp. 84-87, 2014.
5	CUDA C Programming guide. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4jURrxaId
6	CUB official site. Available: https://nvlabs.github.io/cub/#sec1
7	CUDA Toolkit documentation. Available: http://docs.nvidia.com/cuda/
8	T. Park, "Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture," The Journal of Digital Contents Society, Vol. 18, No. 3, pp. 613-619, June 2017. DOI
9	Mark Harris, GPU Gems 3, ch. 39. "Parallel Prefix Sum (Scan) with CUDA". Available: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
10	ANN: A Library for Approximate Nearest Neighbor Searching website. Available: https://www.cs.umd.edu/-mount/ANN/
11	T. Park, "Implementation and Analysis of Parallel kd-Tree Based on Binary Radix Tree with OptiX Realtime Raytracing Framework for Collision Detection and Realtime Raytracing", Korean Society for Computer Game, vol. 27, No. 3, pp. 53-60, September 2014.
12	Shuffle: Tips and Tricks, GPU Technology Conference material. Available: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf

KSCI

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm 병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm