DOI QR코드

DOI QR Code

Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems

대규모 병렬 시스템에서 캐시와 공유메모리를 이용한 유한 차분법 성능

  • Kim, Hyun Kyu (Div. of Computer Science and Engineering, Chonbuk National University) ;
  • Lee, Hyo Jong (Div. of Computer Science and Engineering, CAIIT, Chonbuk National University)
  • 김현규 (전북대학교 컴퓨터공학부) ;
  • 이효종 (전북대학교 컴퓨터공학부, 영상정보신기술연구센터)
  • Received : 2013.01.31
  • Published : 2013.04.25

Abstract

Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

최근 GPU 시스템과 같은 수백 개의 프로세서로 구성된 대규모 병렬 시스템을 이용하여 성능을 향상시키는 방법들이 많이 개발 되었다. 대표적으로 GPU에서 캐싱(Caching)과 유사한 개념으로 공유 메모리가 사용되었다. 출력 값을 얻기 위해서 이웃 값을 참조하는 이미지 필터와 같은 알고리즘들의 경우 이웃 값의 참조가 빈번하게 발생되므로 공유 메모리를 사용할 경우 성능이 향상되었다. 그러나 공유 메모리를 사용하기 위해서는 기존 코드를 재 구현해야만 하고 이는 코드의 복잡도를 증가시키는 원인이 된다. 최근 GPU 시스템에서는 공유 메모리 뿐 아니라 L1과 L2 캐시 메모리를 지원하도록 하였다. L1 캐시 메모리는 공유 메모리와 동일한 하드웨어에 위치하여 캐시의 사용이 성능향상을 도와줄 것으로 예측된다. 따라서 본 논문에서는 캐시 메모리와 공유 메모리의 성능을 비교하였다. 연구결과 성능 면에서 캐시 메모리를 사용한 알고리즘과 공유메모리를 사용한 알고리즘은 유사하였다. 특히 캐시 메모리를 사용하는 경우 공유메모리 사용 프로그래밍에서 나타나는 코드 복잡도의 증가 문제도 동시에 해결할 수 있었다.

Keywords

References

  1. K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick, "Auto-tuning the 27-point stencil for multicore," presented at the In Proc. iWAPT2009: The Fourth International Workshop on Automatic Performance Tuning, 2009.
  2. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, "Larrabee: a many-core x86 architecture for visual computing," presented at the ACM SIGGRAPH 2008 papers, Los Angeles, California, 2008.
  3. NVIDA. (2012). CUDA_C_Programming_Guide (v4.2 ed.). http://developer.download.nvidia.com/ compute/DevZone/docs/html/C/doc/CUDA_C_Prog ramming_Guide.pdf
  4. A. Munshi. (2012). The OpenCL Specification (v1.2 rev15 ed.). http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
  5. D. Moth. (2011). Taming GPU compute with C++ AMP. http://channel9.msdn.com/Events/ BUILD/BUILD2011/TOOL-802T
  6. M. Harris. (2002). The General-Purpose Computation on Graphics Hardware. http://www.gpgpu.org
  7. B. R. Gaster and L. Howes, "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?," Computer, vol. 45, pp. 42-52, 2012.
  8. P. Micikevicius, "3D finite difference computation on GPUs using CUDA," presented at the Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, D.C., 2009.
  9. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures," presented at the Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, 2008.
  10. NVIDIA. (2009). FERMI Compute Architecture White Paper (v1.1 ed.). http://www.nvidia.com/content/PDF/fermi_white_ papers/NVIDIA_Fermi_Compute_Architecture_Whi tepaper.pdf
  11. A. P. Witkin, "Scale-space filtering," presented at the Proceedings of the Eighth international joint conference on Artificial intelligence - Volume 2, Karlsruhe, West Germany, 1983.
  12. G. A. McMechan, "MIGRATION BY EXTRAPOLATION OF TIME-DEPENDENT BOUNDARY VALUES*," Geophysical Prospecting, vol. 31, pp. 413-420, 1983. https://doi.org/10.1111/j.1365-2478.1983.tb01060.x
  13. P. Perona and J. Malik, "Scale-space and edge detection using anisotropic diffusion," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, pp. 629-639, 1990. https://doi.org/10.1109/34.56205
  14. M. J. Black, G. Sapiro, D. H. Marimont, and D. Heeger, "Robust anisotropic diffusion," Image Processing, IEEE Transactions on, vol. 7, pp. 421-432, 1998. https://doi.org/10.1109/83.661192
  15. Y. Xiaosheng, W. Chengdong, J. Tong, and C. Shuo, "A time-dependent anisotropic diffusion image smoothing method," in Intelligent Control and Information Processing (ICICIP), 2011 2nd International Conference on, 2011, pp. 859-862.
  16. A. Yezzi, Jr., "Modified curvature motion for image smoothing and enhancement," Image Processing, IEEE Transactions on, vol. 7, pp. 345-352, 1998. https://doi.org/10.1109/83.661184
  17. R. T. Whitaker and X. Xinwei, "Variable-conductance, level-set curvature for image denoising," in Image Processing, 2001. Proceedings. 2001 International Conference on, 2001, pp. 142-145 vol.3.
  18. G. Gerig, O. Kubler, R. Kikinis, and F. A. Jolesz, "Nonlinear anisotropic filtering of MRI data," Medical Imaging, IEEE Transactions on, vol. 11, pp. 221-232, 1992. https://doi.org/10.1109/42.141646
  19. NVIDIA. (2012). CUDA C BEST PRACTICES GUIDE (v4.1 ed.). http://developer.download.nvidia.com/compute/Dev Zone/docs/html/C/doc/CUDA_C_Best_Practices_G uide.pdf
  20. M. Moazeni, A. Bui, and M. Sarrafzadeh, "A memory optimization technique for softwaremanaged scratchpad memory in GPUs," in Application Specific Processors, 2009. SASP '09. IEEE 7th Symposium on, pp. 43-49, 2009.
  21. 강동수, 신병석. "의료영상에서의 GPGPU활용.", 전자공학회지, 36권 5호, pp 79-87. 2009년 5월.
  22. 이호영, 박종현, 김준성. "CUDA를 이용한 FDTD 알고리즘의 병렬처리.", 전자공학회논문지-CI편, 47권 4호, pp 82-87. 2010년. 7월.
  23. Sung-In Choi, Soon-Yong Park, Jun Kim and Yong-Woon Park. "Multi-view Range Image Registration using CUDA." In: : 대한전자공학회, pp 733-736. 2008년 7월.
  24. McCabe, T. J. "A Complexity Measure." Software Engineering, IEEE Transactions on SE-2(4): 308-320. 1976. https://doi.org/10.1109/TSE.1976.233837