DOI QR코드

DOI QR Code

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm

병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화

  • Park, Taejung (Department of Digital Media, Duksung Women's University)
  • 박태정 (덕성여자대학교 디지털미디어학과)
  • Received : 2017.07.11
  • Accepted : 2017.07.28
  • Published : 2017.07.31

Abstract

This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

본 논문에서는 GPU 병렬 처리 하드웨어 아키텍처 내 최소 물리적 스레드 실행 단위(warp) 내에서 shifted sort 기반 k개 최근접 이웃 검색 기법을 구현하는 방법을 논의하고 일반적으로 동일한 목적으로 널리 사용되는 GPU 기반 kd-tree 및 CPU 기반 ANN 라이브러리와 비교한 결과를 제시한다. 또한 많은 애플리케이션에서 k가 비교적 작은 값이 필요한 경우가 많다는 사실을 고려해서 k가 warp 내부에서 직접 처리 가능한 2, 4, 8, 16개일 때 최적화에 집중한다. 구현 세부에서는 사용한 CUB 공개 라이브러리의 루프 내 메모리 관리 방법, GPU 하드웨어 직접 명령 적용 방법 등의 최적화 방법을 논의한다. 실험 결과, 제안하는 방법은 기존의 GPU 기반 유사 방법에 비해 데이터 지점과 질의 지점의 개수가 각각 $2^{23}$개 일 때 16배 이상의 빠른 처리 속도를 보였으며 이러한 경향은 처리해야 할 데이터의 크기가 커지면 더욱 더 커지는 것으로 판단된다.

Keywords

References

  1. S. Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting," In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGGH-HPG'12), Switzerland, pp. 39-47, 2012.
  2. T. M. Chan, "Approximate nearest neighbor queries revisited," In Proceedings of the Thirteenth Annual Symposium on Computational Geometry (SCG '97), New York, pp. 352-358, 1997.
  3. T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables," The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. https://doi.org/10.9728/dcs.2016.17.3.165
  4. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1st ed. Wrox, pp. 84-87, 2014.
  5. CUDA C Programming guide. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4jURrxaId
  6. CUB official site. Available: https://nvlabs.github.io/cub/#sec1
  7. CUDA Toolkit documentation. Available: http://docs.nvidia.com/cuda/
  8. T. Park, "Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture," The Journal of Digital Contents Society, Vol. 18, No. 3, pp. 613-619, June 2017. https://doi.org/10.9728/DCS.2017.18.3.613
  9. Mark Harris, GPU Gems 3, ch. 39. "Parallel Prefix Sum (Scan) with CUDA". Available: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
  10. Shuffle: Tips and Tricks, GPU Technology Conference material. Available: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf
  11. ANN: A Library for Approximate Nearest Neighbor Searching website. Available: https://www.cs.umd.edu/-mount/ANN/
  12. T. Park, "Implementation and Analysis of Parallel kd-Tree Based on Binary Radix Tree with OptiX Realtime Raytracing Framework for Collision Detection and Realtime Raytracing", Korean Society for Computer Game, vol. 27, No. 3, pp. 53-60, September 2014.