작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units

  • 손동오 (전남대학교 전자컴퓨터공학부) ;
  • 심규연 (전남대학교 전자컴퓨터공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • 투고 : 2015.11.27
  • 심사 : 2015.12.30
  • 발행 : 2015.12.31

초록

최신 GPU는 프로세서 내부에 포함된 다수의 코어를 활용하여 높은 병렬처리가 가능하다. GPU의 높은 병렬성을 활용하는 기법 중 하나인 GPGPU 구조는 GPU에서 대부분의 CPU의 작업을 처리가 가능하게 해주며, GPU의 높은 병렬성과 하드웨어자원을 효과적으로 활용할 수 있다. 본 논문에서는 다양한 벤치마크 프로그램을 활용하여 CTA(Cooperative Thread Array) 할당 개수 변화에 따른 메모리 효율성과 성능을 분석하고자 한다. 실험결과, CTA 할당 개수 증가에 따라 다수의 벤치마크 프로그램에서 성능이 향상되었지만, 일부 벤치마크 프로그램에서는 CTA 할당 개수 증가에 따른 성능 향상이 발생하지 않았다. 이러한 이유로는 벤치마크 프로그램에서 생성된 CTA 개수가 적거나 동시에 수행할 수 있는 CTA 개수가 정해져 있기 때문으로 판단된다. 또한, 각 벤치마크 프로그램별로 메모리 채널 정체에 따른 메모리 스톨, 내부연결망 정체에 따른 메모리 스톨, 파이프라인의 메모리 단계에서 발생하는 스톨을 분석하여 성능과의 연관성을 파악하였다. 본 연구의 분석결과는 GPGPU 구조의 병렬성 및 메모리 효율성 향상을 위한 연구에 대한 정보로 활용될 것으로 기대된다.

Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.

키워드

참고문헌

  1. J. Y. Chang, W. J. Kim, K. J. Byun, and N. W. Eum, "Performance Analysis for Multimedia Video Codec on On-Chip Network," KISM Smart Media Journal, Vol. 1, No.1, pp. 27-35, 2012.
  2. S. B. Heo, J. H. Park, and H. S. Jo, "An performance analysis on SSD caching mechanism in Linux," KISM Smart Media Journal, Vol. 4, No. 2, pp. 62-67, 2015.
  3. V. Agarwal, M.S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of the 27th International Symposium on Computer Architecture, pp. 248-259, 2000.
  4. K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multi processor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp. 2-11, 1996.
  5. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microArchitectures," In Proceedings of 27th International Symposium on Computer Architecture, pp. 248-259, 2000.
  6. M. D. Hill and M. R. Marty, "Amdahl's law in the multicore era," IEEE Computer, Vol. 41, No. 7, pp. 33-38, 2008.
  7. S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA:coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads," In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 515-527, 2015.
  8. D. Voitsechov, and Y. Etsion, "Single-graph multiple flows: Energy efficient design alternative for GPGPUs," In Proceedings of the 41st Annual International Symposium on Computer Architecture, pp. 205-216, 2014.
  9. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. G. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," In Proceedings of 20th IEEE International Symposium on High Performance Computer Architecture, pp. 260-271, 2014.
  10. H. J. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun, "Locality-Aware Mapping of Nested Parallel Patterns on GPUs," In Proceedings of 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 63-74, 2014.
  11. H. J. Choi, and C. H. Kim, "Analysis on Memory Characteristics of Graphics Processing Units for Designing Memory System of General-Purpose Computing on Graphics Processing Units," KISM Smart Media Journal, Vol. 3, No. 1, pp. 33-38, 2014.
  12. I. A. Buck, "Programming CUDA," In Supercomputing 2007 Tutorial Notes, 2007.
  13. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
  14. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," In Proceedings of the International Symposium on Microarchitecture, pp. 469-480, 2009.
  15. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, available at www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
  16. CUDA SDK, available at http://developerdownload.nvidia.com/compute/cuda/sdk/website/samples.hrml
  17. S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," In Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.