DOI QR코드

DOI QR Code

Implementation of parallel blocked LU decomposition program for utilizing cache memory on GP-GPUs

GP-GPU의 캐시메모리를 활용하기 위한 병렬 블록 LU 분해 프로그램의 구현

  • Kim, Youngtae (Department of Computer Science, Gangneung-Wonju National University) ;
  • Kim, Doo-Han (Department of Computer Science, Gangneung-Wonju National University) ;
  • Yu, Myoung-Han (Department of Computer Science, Gangneung-Wonju National University)
  • Received : 2013.08.01
  • Accepted : 2013.10.14
  • Published : 2013.12.31

Abstract

GP-GPUs are general purposed GPUs for numerical computation based on multiple threads which are originally for graphic processing. GP-GPUs provide cache memory in a form of shared memory which user programs can access directly, unlikely typical cache memory. In this research, we implemented the parallel block LU decomposition program to utilize cache memory in GP-GPUs. The parallel blocked LU decomposition program designed with Nvidia CUDA C run 7~8 times faster than nun-blocked LU decomposition program in the same GP-GPU computation environment.

GP-GPU는 그래픽 처리를 위한 GPU의 다중쓰레드를 일반 수치 계산에 활용하여 초고속으로 계산하는 장치이다. GP-GPU에서는 CPU의 캐시메모리와는 달리 다중쓰레드가 공유하는 공유메모리의 형태로 캐시메모리를 제공하며, 공유메모리는 사용자 프로그램에서 직접 제어할 수 있다. 본 연구에서는 GP-GPU의 캐시메모리를 사용하여 계산 성능을 향상시키기 위한 블록 구조의 병렬 LU 분해 프로그램을 구현하였다. Nvidia CUDA C로 구현된 병렬 블록 LU 분해 프로그램은 동일한 GP-GPU 상에서 일반 LU 분해 프로그램에 비교하여 7~8배 이상의 속도 개선을 보였다.

Keywords

References

  1. Nvidia, CUDA Programming Guide 4.2.
  2. J. Nickolls, "Scalable Parallel Programming with CUDA", ACM Queue, vol. 6, no. 2, pp.40 -53 2008.
  3. E. Lindholm, "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro, vol. 28, no. 2, pp.39-55 2008. https://doi.org/10.1109/MM.2008.31
  4. Nico et al., "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware", SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pp. 3.
  5. John L. Hennessy, David A. Patterson. "Computer Architecture: A Quantitative Approach". 2011.
  6. Golub, Gene H. Van Loan, Charles F. (1996), Matrix Computations (3rd ed.), Baltimore: Johns Hopkins.
  7. Shin, B., Y. Kim, Implementation of high performance parallel LU factorization program for multi-threads on GPGPUs, Journal of Korean Society for Internet Information, Vol. 12, No. 3, pp. 131-137, 2011.
  8. Kim, Y., Performance Comparison of Two Parallel LU Decomposition Algorithms on MasPar Machines, Journal of IEEE Korea Council, Vol. 2, No. 2, pp. 247-255, 1999.
  9. G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, 'Solving Problems on concurrent Processors Vol. 1.', Prentice Hall, Englewood Cliffs, NJ, 1988.
  10. Gallivan et al., "Parallel Algorithms for Matrix Computations", SIAM, Philadelphia, 1991.