DOI QR코드

DOI QR Code

Large-scale 3D fast Fourier transform computation on a GPU

  • Jaehong Lee (Computer Science and Engineering, KOREAT) ;
  • Duksu Kim (Computer Science and Engineering, KOREAT)
  • 투고 : 2022.08.05
  • 심사 : 2022.12.27
  • 발행 : 2023.12.10

초록

We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

키워드

과제정보

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) through the Ministry of Education under Grant 2021R1I1A3048263 (HighPerformance CGH Algorithms for UltraHigh Resolution Hologram Generation, 100%) and the Education and Research Promotion Program of Korea University of Technology and Education (KOREATECH), in 2023.

참고문헌

  1. R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications, Vol. 31999, McGraw-Hill, New York, 1986.
  2. J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (1965), no. 90, 297-301. https://doi.org/10.1090/s0025-5718-1965-0178586-1
  3. M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proc. IEEE 93 (2005), no. 2, 216-231. https://doi.org/10.1109/JPROC.2004.840301
  4. M. Frigo and S. G. Johnson, FFTW: an adaptive software architecture for the FFT, (Proc. 1998 IEEE Int. Conf. Acoust. Speech Signal Process. Seattle, WA, USA), 1998, pp. 1381-1384.
  5. Intel, Intel® Math Kernel Library, 2020. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html
  6. K. Matsushima, Introduction to computer holography: creating computer-generated holograms as the ultimate 3D image, Springer Nature, London, UK, 2020.
  7. D. Takahashi, Fast Fourier transform algorithms for parallel computers, Springer, Berlin, Heidelberg, Germany, 2019.
  8. MathWorks, Matlab, Web Page, https://mathworks.com/, 2022.
  9. NVIDIA, CUFFT libraries. https://docs.nvidia.com/cuda/cufft/index.html
  10. S. Chen and X. Li, A hybrid GPU/CPU FFT library for large FFT problems, (IEEE 32nd Int. Perform. Comput. Commun. Conf. (IPCCC), San Diego, CA, USA), 2013, pp. 1-10.
  11. T. M. John Cheng, Professional CUDA C programming, John Wiley & Sons, Hoboken, New Jersey, United States, 2014.
  12. A. Kumar, S. Gavel, and A. S. Raghuvanshi, FPGA implementation of radix-4-based two-dimensional FFT with and without pipelining using efficient data reordering scheme, Nanoelectronics, circuits and communication systems, Springer, Republic of Singapore, 2021, pp. 613-623.
  13. N. H. Nguyen, S. A. Khan, C.-H. Kim, and J.-M. Kim, A high-performance, resource-efficient, reconfigurable parallel-pipelined FFT processor for FPGA platforms, Microprocess. Microsyst. 60 (2018), 96-106. https://doi.org/10.1016/j.micpro.2018.04.003
  14. S. Khokhriakov, R. R. Manumachu, and A. Lastovetsky, Performance optimization of multithreaded 2D fast Fourier transform on multicore processors using load imbalancing parallel computing method, IEEE Access 6 (2018), 64202-64224. https://doi.org/10.1109/ACCESS.2018.2878271
  15. L. Gu, X. Li, and J. Siegel, An empirically tuned 2D and 3D FFT library on CUDA GPU, (Proc. 24th ACM Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2010, pp. 305-314.
  16. L. Gu, J. Siegel, and X. Li, Using GPUs to compute large out-of-card FFTs, (Proc. Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2011. https://doi.org/10.1145/1995896.1995937
  17. Z. Zhao and Y. Zhao, The optimization of FFT algorithm based with parallel computing on GPU, (Proc. IEEE 3rd Adv. Inf. Technol. Electron. Autom. Control Conf. (IAEAC), IEEE, Chongqing, China), 2018, pp. 2003-2007.
  18. Advanced Micro Devices, rocFFT. https://rocfft.readthedocs.io/en/rocm-5.3.1/, Accessed: 2022-11-04.
  19. Y. Hu, L. Lu, and C. Li, Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU, J. Supercomput. 78 (2022), no. 16, 18189-18208. https://doi.org/10.1007/s11227-022-04570-9
  20. S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak, L. Rauchwerger, F. Zaffar, and W. M. Hwu, Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles, (30th Int. Conf. Parallel Archit. Compilation Tech. (PACT), Atlanta, GA, USA), 2021. https://doi.org/10.1109/PACT52795.2021.00032
  21. B. Li, S. Cheng, and J. Lin, tcFFT: a fast half-precision FFT library for NVIDIA Tensor Cores, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Portland, OR, USA), 2021. https://doi.org/10.1109/Cluster48925.2021.00035
  22. L. Pisha and L. Ligowski, Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores, (IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), Portland, OR, USA), 2021. https://doi.org/10.1109/IPDPS49936.2021.00059
  23. A. Sorna, X. Cheng, E. D. Azevedo, K. Won, and S. Tomov, Optimizing the fast Fourier transform using mixed precision on tensor core hardware, (IEEE 25th Int. Conf. High Perform. Comput. Workshops (HiPCW), Bengaluru, India), 2018, pp. 3-7.
  24. Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka, An efficient, model-based CPU-GPU heterogeneous FFT library, (IEEE Int. Symp. Parallel Distrib. Process., Miami, FL, USA), 2008. https://doi.org/10.1109/IPDPS.2008.4536163
  25. J. Lee, H. Kang, H. J. Yeom, S. Cheon, J. Park, and D. Kim, Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation, Opt. Express 29 (2021), no. 12, 19094-19112. https://doi.org/10.1364/OE.422266
  26. A. Gholami, J. Hill, D. Malhotra, and G. Biros, AccFFT: a library for distributed-memory FFT on CPU and GPU architectures, arXive preprint, 2015. https://doi.org/10.48550/arXiv.1506.07933
  27. D. Sharp, M. Stoyanov, S. Tomov, and J. Dongarra, A more portable HeFFTe: implementing a fallback algorithm for scalable Fourier transforms, (IEEE High Perform. Extreme Comput. Conf. (HPEC), Waltham, MA, USA), 2021, pp. 1-5.
  28. A. Ayala, S. Tomov, M. Stoyanov, A. Haidar, and J. Dongarra, Performance analysis of parallel FFT on large multi-GPU systems, (IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), Lyon, France), 2022. https://doi.org/10.1109/IPDPSW55747.2022.00072
  29. A. Ayala, S. Tomov, M. Stoyanov, and J. Dongarra, Scalability issues in FFT computation, Parallel computing technologies, Lecture Notes in Computer Science, Springer, Cham, Switzerland, 2021, pp. 279-287.
  30. S. Cayrols, J. Li, G. Bosilca, S. Tomov, A. Ayala, and J. Dongarra, Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Heidelberg, Germany), 2022, pp. 152-160.
  31. H. Kang, H. C. Kwon, and D. Kim, HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs, Computing 102 (2020), no. 12, 2607-2631. https://doi.org/10.1007/s00607-020-00846-1
  32. N. V. Sunitha, K. Raju, and N. N. Chiplunkar, Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead, (Int. Conf. Inventive Commun. Comput. Technol. (ICICCT), Coimbatore, India), 2017, pp. 211-215.
  33. H. Kang, J. Lee, and D. Kim, HI-FFT: heterogeneous parallel in-place algorithm for large-scale 2D-FFT, IEEE Access 9 (2021), 120261-120273. https://doi.org/10.1109/ACCESS.2021.3108404
  34. D. Kim, J. Lee, J. Lee, I. Shin, J. Kim, and S.-E. Yoon, Scheduling in heterogeneous computing environments for proximity queries, IEEE Trans. Visual. Comput. Graphics 19 (2013), no. 9, 1513-1525. https://doi.org/10.1109/TVCG.2013.71