Large-scale 3D fast Fourier transform computation on a GPU

Jaehong Lee;Duksu Kim;

doi:10.4218/etrij.2022-0297

ETRI Journal

제45권6호
/
Pages.1035-1045
/
2023
/
1225-6463(pISSN)
/
2233-7326(eISSN)

한국전자통신연구원 (Electronics and Telecommunications Research Institute)

DOI QR Code

Large-scale 3D fast Fourier transform computation on a GPU

Jaehong Lee (Computer Science and Engineering, KOREAT) ;
Duksu Kim (Computer Science and Engineering, KOREAT)

투고 : 2022.08.05
심사 : 2022.12.27
발행 : 2023.12.10

https://doi.org/10.4218/etrij.2022-0297 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

키워드

과제정보

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) through the Ministry of Education under Grant 2021R1I1A3048263 (HighPerformance CGH Algorithms for UltraHigh Resolution Hologram Generation, 100%) and the Education and Research Promotion Program of Korea University of Technology and Education (KOREATECH), in 2023.

참고문헌

R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications, Vol. 31999, McGraw-Hill, New York, 1986.
J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (1965), no. 90, 297-301. https://doi.org/10.1090/s0025-5718-1965-0178586-1
M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proc. IEEE 93 (2005), no. 2, 216-231. https://doi.org/10.1109/JPROC.2004.840301
M. Frigo and S. G. Johnson, FFTW: an adaptive software architecture for the FFT, (Proc. 1998 IEEE Int. Conf. Acoust. Speech Signal Process. Seattle, WA, USA), 1998, pp. 1381-1384.
Intel, Intel^® Math Kernel Library, 2020. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html
K. Matsushima, Introduction to computer holography: creating computer-generated holograms as the ultimate 3D image, Springer Nature, London, UK, 2020.
D. Takahashi, Fast Fourier transform algorithms for parallel computers, Springer, Berlin, Heidelberg, Germany, 2019.
MathWorks, Matlab, Web Page, https://mathworks.com/, 2022.
NVIDIA, CUFFT libraries. https://docs.nvidia.com/cuda/cufft/index.html
S. Chen and X. Li, A hybrid GPU/CPU FFT library for large FFT problems, (IEEE 32nd Int. Perform. Comput. Commun. Conf. (IPCCC), San Diego, CA, USA), 2013, pp. 1-10.
T. M. John Cheng, Professional CUDA C programming, John Wiley & Sons, Hoboken, New Jersey, United States, 2014.
A. Kumar, S. Gavel, and A. S. Raghuvanshi, FPGA implementation of radix-4-based two-dimensional FFT with and without pipelining using efficient data reordering scheme, Nanoelectronics, circuits and communication systems, Springer, Republic of Singapore, 2021, pp. 613-623.
N. H. Nguyen, S. A. Khan, C.-H. Kim, and J.-M. Kim, A high-performance, resource-efficient, reconfigurable parallel-pipelined FFT processor for FPGA platforms, Microprocess. Microsyst. 60 (2018), 96-106. https://doi.org/10.1016/j.micpro.2018.04.003
S. Khokhriakov, R. R. Manumachu, and A. Lastovetsky, Performance optimization of multithreaded 2D fast Fourier transform on multicore processors using load imbalancing parallel computing method, IEEE Access 6 (2018), 64202-64224. https://doi.org/10.1109/ACCESS.2018.2878271
L. Gu, X. Li, and J. Siegel, An empirically tuned 2D and 3D FFT library on CUDA GPU, (Proc. 24th ACM Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2010, pp. 305-314.
L. Gu, J. Siegel, and X. Li, Using GPUs to compute large out-of-card FFTs, (Proc. Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2011. https://doi.org/10.1145/1995896.1995937
Z. Zhao and Y. Zhao, The optimization of FFT algorithm based with parallel computing on GPU, (Proc. IEEE 3rd Adv. Inf. Technol. Electron. Autom. Control Conf. (IAEAC), IEEE, Chongqing, China), 2018, pp. 2003-2007.
Advanced Micro Devices, rocFFT. https://rocfft.readthedocs.io/en/rocm-5.3.1/, Accessed: 2022-11-04.
Y. Hu, L. Lu, and C. Li, Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU, J. Supercomput. 78 (2022), no. 16, 18189-18208. https://doi.org/10.1007/s11227-022-04570-9
S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak, L. Rauchwerger, F. Zaffar, and W. M. Hwu, Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles, (30th Int. Conf. Parallel Archit. Compilation Tech. (PACT), Atlanta, GA, USA), 2021. https://doi.org/10.1109/PACT52795.2021.00032
B. Li, S. Cheng, and J. Lin, tcFFT: a fast half-precision FFT library for NVIDIA Tensor Cores, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Portland, OR, USA), 2021. https://doi.org/10.1109/Cluster48925.2021.00035
L. Pisha and L. Ligowski, Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores, (IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), Portland, OR, USA), 2021. https://doi.org/10.1109/IPDPS49936.2021.00059
A. Sorna, X. Cheng, E. D. Azevedo, K. Won, and S. Tomov, Optimizing the fast Fourier transform using mixed precision on tensor core hardware, (IEEE 25th Int. Conf. High Perform. Comput. Workshops (HiPCW), Bengaluru, India), 2018, pp. 3-7.
Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka, An efficient, model-based CPU-GPU heterogeneous FFT library, (IEEE Int. Symp. Parallel Distrib. Process., Miami, FL, USA), 2008. https://doi.org/10.1109/IPDPS.2008.4536163
J. Lee, H. Kang, H. J. Yeom, S. Cheon, J. Park, and D. Kim, Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation, Opt. Express 29 (2021), no. 12, 19094-19112. https://doi.org/10.1364/OE.422266
A. Gholami, J. Hill, D. Malhotra, and G. Biros, AccFFT: a library for distributed-memory FFT on CPU and GPU architectures, arXive preprint, 2015. https://doi.org/10.48550/arXiv.1506.07933
D. Sharp, M. Stoyanov, S. Tomov, and J. Dongarra, A more portable HeFFTe: implementing a fallback algorithm for scalable Fourier transforms, (IEEE High Perform. Extreme Comput. Conf. (HPEC), Waltham, MA, USA), 2021, pp. 1-5.
A. Ayala, S. Tomov, M. Stoyanov, A. Haidar, and J. Dongarra, Performance analysis of parallel FFT on large multi-GPU systems, (IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), Lyon, France), 2022. https://doi.org/10.1109/IPDPSW55747.2022.00072
A. Ayala, S. Tomov, M. Stoyanov, and J. Dongarra, Scalability issues in FFT computation, Parallel computing technologies, Lecture Notes in Computer Science, Springer, Cham, Switzerland, 2021, pp. 279-287.
S. Cayrols, J. Li, G. Bosilca, S. Tomov, A. Ayala, and J. Dongarra, Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Heidelberg, Germany), 2022, pp. 152-160.
H. Kang, H. C. Kwon, and D. Kim, HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs, Computing 102 (2020), no. 12, 2607-2631. https://doi.org/10.1007/s00607-020-00846-1
N. V. Sunitha, K. Raju, and N. N. Chiplunkar, Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead, (Int. Conf. Inventive Commun. Comput. Technol. (ICICCT), Coimbatore, India), 2017, pp. 211-215.
H. Kang, J. Lee, and D. Kim, HI-FFT: heterogeneous parallel in-place algorithm for large-scale 2D-FFT, IEEE Access 9 (2021), 120261-120273. https://doi.org/10.1109/ACCESS.2021.3108404
D. Kim, J. Lee, J. Lee, I. Shin, J. Kim, and S.-E. Yoon, Scheduling in heterogeneous computing environments for proximity queries, IEEE Trans. Visual. Comput. Graphics 19 (2013), no. 9, 1513-1525. https://doi.org/10.1109/TVCG.2013.71

ETRI Journal

Large-scale 3D fast Fourier transform computation on a GPU

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)