DOI QR코드

DOI QR Code

Design of a High-Performance Mobile GPGPU with SIMT Architecture based on a Small-size Warp Scheduler

작은 크기의 Warp 스케쥴러 기반 SIMT구조 고성능 모바일 GPGPU 설계

  • Received : 2021.09.17
  • Accepted : 2021.09.23
  • Published : 2021.09.30

Abstract

This paper proposed and designed a structure to achieve high performance with a small number of cores in GPGPU with SIMT structure. GPGPU for application to mobile devices requires a structure to increase performance compared to power consumption. In order to reduce power consumption, the number of cores decreased, but to improve performance, the size of the warp scheduler for managing threads was set to 4, which was greatly reduced than 32 of general GPGPU. Reducing warp size can reduce the number of idle cycles in pipelines and efficiently apply memory latency to reduce miss penalty when accessing cache memory. The designed GPGPU measured computational performance using a test program that includes floating point operations and measured power consumption through a 28nm CMOS process to obtain 104.5GFlops/Watt as a performance per power. The results of this paper showed about four times better performance per power compared to Tegra K1 of Nvidia

본 논문은 SIMT구조의 GPGPU에서 적은 core수로 고성능을 달성하기 위한 구조를 제안하고 설계하였다. 모바일기기에 적용하기 위한 GPGPU는 소모전력대비 성능을 높이기 위한 구조가 필수적이다. 소모전력을 줄이기 위해서 core수가 줄어든 대신 성능을 높이기 위해 thread를 관리하기 위한 warp scheduler의 size를 4로 하여 일반적인 GPGPU의 32 보다 크게 줄였다. Warp size를 적게 되면 pipeline의 idle cycle수를 줄일 수 있고 cache 메모리 접근시 miss penalty를 줄이기 위한 memory latency 적용이 효율적이다. 설계된 GPGPU는 부동소수점 연산을 포함하는 테스트 프로그램으로 연산 성능을 측정하고 28nm CMOS공정으로 소비전력을 측정하여 전력당 성능지수로 104.5GFlops/Watt를 얻었다. 본 논문의 결과는 Nvidia의 Tegra K1과 비교하였을 때 약 4배 우수한 전력당 성능지수를 보였다.

Keywords

Acknowledgement

This work was supported by Seokyeong University in 2020.

References

  1. Ahmad Lashgar, A. Baniasadi, & A. Khonsari. "Investigating Warp Size Impact in GPUs," Computer Science, 2012. ArXiv:1205.4967.
  2. Gaurav Mitra, Andrew Haigh, Luke Angove, Anish Varghese, "Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing," GTC 2016, 2016. DOI: 10.1109/HPCSim.2016.7568401
  3. Kwan Ho Lee, Chi Yong Kim, "A Design of a High Performance Stream Processor without Superscalar Architecture," J.inst.Korean. electr. electron.eng, Vol.21, No.1, pp.77-80, 2017. DOI: 10.7471/ikeee.2017.21.1.77
  4. Kwan Ho Lee, Chi Yong Kim, "Thread Distribution Method of GP-GPU for Accelerating Parallel Algorithms," J.inst.Korean. electr. electron. eng, Vol.21, No.1, pp.92-95, 2017. DOI: 10.7471/ikeee.2017.21.1.92
  5. Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M., "DynamicWarp Formation and Scheduling for Efficient GPU Control Flow," 40th Annual IEEE/ACM International Symposium(MICRO 2007), pp.407-420, 2007. DOI: 10.1109/MICRO.2007.30
  6. Kwang Yeob Lee, "Implementation of a Memory Operation System Architecture for Memory Latency Penalty Reduction in SIMT Based Stream Processor," J.inst.Korean. electr. electron.eng, Vol.18, No.3, pp.392-397, 2014. DOI: 10.7471/ikeee.2014.18.3.392
  7. M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, "Parallel computing experiences with CUDA," IEEE Micro, Vol.28, no.4, pp.13-27, 2008. DOI: 10.1109/MM.2008.57