Acknowledgement
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) through the Ministry of Education under Grant 2021R1I1A3048263 (HighPerformance CGH Algorithms for UltraHigh Resolution Hologram Generation, 100%) and the Education and Research Promotion Program of Korea University of Technology and Education (KOREATECH), in 2023.
References
- R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications, Vol. 31999, McGraw-Hill, New York, 1986.
- J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (1965), no. 90, 297-301. https://doi.org/10.1090/s0025-5718-1965-0178586-1
- M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proc. IEEE 93 (2005), no. 2, 216-231. https://doi.org/10.1109/JPROC.2004.840301
- M. Frigo and S. G. Johnson, FFTW: an adaptive software architecture for the FFT, (Proc. 1998 IEEE Int. Conf. Acoust. Speech Signal Process. Seattle, WA, USA), 1998, pp. 1381-1384.
- Intel, Intel® Math Kernel Library, 2020. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html
- K. Matsushima, Introduction to computer holography: creating computer-generated holograms as the ultimate 3D image, Springer Nature, London, UK, 2020.
- D. Takahashi, Fast Fourier transform algorithms for parallel computers, Springer, Berlin, Heidelberg, Germany, 2019.
- MathWorks, Matlab, Web Page, https://mathworks.com/, 2022.
- NVIDIA, CUFFT libraries. https://docs.nvidia.com/cuda/cufft/index.html
- S. Chen and X. Li, A hybrid GPU/CPU FFT library for large FFT problems, (IEEE 32nd Int. Perform. Comput. Commun. Conf. (IPCCC), San Diego, CA, USA), 2013, pp. 1-10.
- T. M. John Cheng, Professional CUDA C programming, John Wiley & Sons, Hoboken, New Jersey, United States, 2014.
- A. Kumar, S. Gavel, and A. S. Raghuvanshi, FPGA implementation of radix-4-based two-dimensional FFT with and without pipelining using efficient data reordering scheme, Nanoelectronics, circuits and communication systems, Springer, Republic of Singapore, 2021, pp. 613-623.
- N. H. Nguyen, S. A. Khan, C.-H. Kim, and J.-M. Kim, A high-performance, resource-efficient, reconfigurable parallel-pipelined FFT processor for FPGA platforms, Microprocess. Microsyst. 60 (2018), 96-106. https://doi.org/10.1016/j.micpro.2018.04.003
- S. Khokhriakov, R. R. Manumachu, and A. Lastovetsky, Performance optimization of multithreaded 2D fast Fourier transform on multicore processors using load imbalancing parallel computing method, IEEE Access 6 (2018), 64202-64224. https://doi.org/10.1109/ACCESS.2018.2878271
- L. Gu, X. Li, and J. Siegel, An empirically tuned 2D and 3D FFT library on CUDA GPU, (Proc. 24th ACM Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2010, pp. 305-314.
- L. Gu, J. Siegel, and X. Li, Using GPUs to compute large out-of-card FFTs, (Proc. Int. Conf. Supercomput., Association for Computing Machinery, New York, NY, USA), 2011. https://doi.org/10.1145/1995896.1995937
- Z. Zhao and Y. Zhao, The optimization of FFT algorithm based with parallel computing on GPU, (Proc. IEEE 3rd Adv. Inf. Technol. Electron. Autom. Control Conf. (IAEAC), IEEE, Chongqing, China), 2018, pp. 2003-2007.
- Advanced Micro Devices, rocFFT. https://rocfft.readthedocs.io/en/rocm-5.3.1/, Accessed: 2022-11-04.
- Y. Hu, L. Lu, and C. Li, Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU, J. Supercomput. 78 (2022), no. 16, 18189-18208. https://doi.org/10.1007/s11227-022-04570-9
- S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak, L. Rauchwerger, F. Zaffar, and W. M. Hwu, Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles, (30th Int. Conf. Parallel Archit. Compilation Tech. (PACT), Atlanta, GA, USA), 2021. https://doi.org/10.1109/PACT52795.2021.00032
- B. Li, S. Cheng, and J. Lin, tcFFT: a fast half-precision FFT library for NVIDIA Tensor Cores, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Portland, OR, USA), 2021. https://doi.org/10.1109/Cluster48925.2021.00035
- L. Pisha and L. Ligowski, Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores, (IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), Portland, OR, USA), 2021. https://doi.org/10.1109/IPDPS49936.2021.00059
- A. Sorna, X. Cheng, E. D. Azevedo, K. Won, and S. Tomov, Optimizing the fast Fourier transform using mixed precision on tensor core hardware, (IEEE 25th Int. Conf. High Perform. Comput. Workshops (HiPCW), Bengaluru, India), 2018, pp. 3-7.
- Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka, An efficient, model-based CPU-GPU heterogeneous FFT library, (IEEE Int. Symp. Parallel Distrib. Process., Miami, FL, USA), 2008. https://doi.org/10.1109/IPDPS.2008.4536163
- J. Lee, H. Kang, H. J. Yeom, S. Cheon, J. Park, and D. Kim, Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation, Opt. Express 29 (2021), no. 12, 19094-19112. https://doi.org/10.1364/OE.422266
- A. Gholami, J. Hill, D. Malhotra, and G. Biros, AccFFT: a library for distributed-memory FFT on CPU and GPU architectures, arXive preprint, 2015. https://doi.org/10.48550/arXiv.1506.07933
- D. Sharp, M. Stoyanov, S. Tomov, and J. Dongarra, A more portable HeFFTe: implementing a fallback algorithm for scalable Fourier transforms, (IEEE High Perform. Extreme Comput. Conf. (HPEC), Waltham, MA, USA), 2021, pp. 1-5.
- A. Ayala, S. Tomov, M. Stoyanov, A. Haidar, and J. Dongarra, Performance analysis of parallel FFT on large multi-GPU systems, (IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), Lyon, France), 2022. https://doi.org/10.1109/IPDPSW55747.2022.00072
- A. Ayala, S. Tomov, M. Stoyanov, and J. Dongarra, Scalability issues in FFT computation, Parallel computing technologies, Lecture Notes in Computer Science, Springer, Cham, Switzerland, 2021, pp. 279-287.
- S. Cayrols, J. Li, G. Bosilca, S. Tomov, A. Ayala, and J. Dongarra, Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs, (IEEE Int. Conf. Cluster Comput. (CLUSTER), Heidelberg, Germany), 2022, pp. 152-160.
- H. Kang, H. C. Kwon, and D. Kim, HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs, Computing 102 (2020), no. 12, 2607-2631. https://doi.org/10.1007/s00607-020-00846-1
- N. V. Sunitha, K. Raju, and N. N. Chiplunkar, Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead, (Int. Conf. Inventive Commun. Comput. Technol. (ICICCT), Coimbatore, India), 2017, pp. 211-215.
- H. Kang, J. Lee, and D. Kim, HI-FFT: heterogeneous parallel in-place algorithm for large-scale 2D-FFT, IEEE Access 9 (2021), 120261-120273. https://doi.org/10.1109/ACCESS.2021.3108404
- D. Kim, J. Lee, J. Lee, I. Shin, J. Kim, and S.-E. Yoon, Scheduling in heterogeneous computing environments for proximity queries, IEEE Trans. Visual. Comput. Graphics 19 (2013), no. 9, 1513-1525. https://doi.org/10.1109/TVCG.2013.71