DOI QR코드

DOI QR Code

Using the On-Package Memory of Manycore Processor for Improving Performance of MPI Intra-Node Communication

MPI 노드 내 통신 성능 향상을 위한 매니코어 프로세서의 온-패키지 메모리 활용

  • 조중연 (건국대학교 컴퓨터공학부) ;
  • 진현욱 (건국대학교 컴퓨터공학부) ;
  • 남덕윤 (한국과학기술정보연구원 슈퍼컴퓨팅센터)
  • Received : 2016.08.18
  • Accepted : 2016.11.23
  • Published : 2017.02.15

Abstract

The emerging next-generation manycore processors for high-performance computing are equipped with a high-bandwidth on-package memory along with the traditional host memory. The Multi-Channel DRAM (MCDRAM), for example, is the on-package memory of the Intel Xeon Phi Knights Landing (KNL) processor, and theoretically provides a four-times-higher bandwidth than the conventional DDR4 memory. In this paper, we suggest a mechanism to exploit MCDRAM for improving the performance of MPI intra-node communication. The experiment results show that the MPI intra-node communication performance can be improved by up to 272 % compared with the case where the DDR4 is utilized. Moreover, we analyze not only the performance impact of different MCDRAM-utilization mechanisms, but also that of core affinity for processes.

고성능 컴퓨팅 환경을 위해서 최근 등장한 차세대 매니코어 프로세서는 전통적인 구조의 메모리와 함께 고대역 온-패키지 메모리를 장착하고 있다. Intel Xeon Phi Knights Landing(KNL) 프로세서의 온-패키지 메모리인 Multi-Channel DRAM(MCDRAM)은 기존의 DDR4 메모리보다 이론적으로 네 배 높은 대역폭을 제공한다. 본 논문에서는 MCDRAM을 이용하여 MPI 노드 내 통신 성능을 향상시키기 위한 방안을 제안한다. 실험 결과, 제안된 기법을 사용할 경우 DDR4를 사용하는 경우와 비교해서 MPI 노드 내 통신 성능을 최대 272% 향상시킬 수 있음을 보인다. 또한 MCDRAM 활용 방법에 따른 성능 영향뿐만 아니라 프로세스의 코어 친화도에 따른 성능 영향을 보인다.

Keywords

Acknowledgement

Grant : 매니코어 기반 슈퍼컴퓨터 작업 및 데이터 처리 기술 연구, 매니코어 기반 초고성능 스케일러블 OS 기초연구

Supported by : 한국과학기술정보연구원, 정보통신기술진흥센터

References

  1. A. Sodani, "Knights Landing (KNL): 2nd Generation Intel Xeon Phi Processor," Presented at Hot-Chips 2015, Aug. 2015.
  2. Intel. (2016, June 22). Intel Xeon Phi Processor Product Brief [Online]. Available: http://www.intel.co.kr/content/dam/www/public/us/en/documents/product -brief s/xeon-phi-processor-product -brief.pdf (downloaded 2016, Aug. 14)
  3. A. Heinecke , A. Breuer, M. Bader, and P. Dubey, "High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing)," High Performance Computing, Vol. 9697, pp. 343-362, Jun. 2016.
  4. Message Passing Interface, https://www.mpi-forum.org/.
  5. L. Chai, P. Lai, H.-W. Jin, and D. K. Panda, "Designing an efficient kernel-level and user-level hybrid approach for MPI intra-node communication on multi-core systems," Proc. of International Conference on Parallel Processing, pp. 222-229, Sep. 2008.
  6. H.-W. Jin, S. Sur, L. Chai, and D. K. Panda, "Lightweight kernel-level primitives for highperformance MPI intra-node communication over multi-core systems," Proc. of IEEE International Cluster Conference, pp. 446-451, Sep. 2007.
  7. D. Buntinas, B. Goglin, D. Goodell, G. Mercier, and S. Moreaud, "Cache-efficient, intranode, largemessage MPI communication with MPICH2-Nemesis," Proc. of International Conference on Parallel Processing, pp. 462-469, Sep. 2009.
  8. B. Goglin, M. Stephanie, "KNEM: a generic and scalable kernel-assisted intra-node MPI communication framework," Journal of Parallel and Distributed Computing, Vol. 73, No. 2, pp. 176-188, 2013. https://doi.org/10.1016/j.jpdc.2012.09.016
  9. L. Chai, A. Hartono, and D. K. Panda, "Designing high performance and scalable MPI intra-node communication support for clusters," Proc. of the IEEE International Conference on Cluster Computing, pp. 1-10, 2006.
  10. X. Wu, V. Taylor, C. Lively, and S. Sharkawi, "Performance analysis and optimization of parallel scientific applications on CMP cluster systems," Proc. of International Conference on Parallel Processing- Workshops, pp. 188-195, 2008.
  11. C. Zhang, X. Yuan, and A. Srinivasan, "Processor affinity and MPI performance on SMP-CMP clusters," Proc. of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 1-8, 2010.
  12. M. Si, Y. Ishikawa, and M. Tatagi, "Direct MPI Library for Intel Xeon Phi Co-Processors," Proc. of the IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 816-824, 2013.
  13. S. Potluri, K. Hamidouche, D. Bureddy, and D. K. Panda, "MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand," Proc. of the Extreme Scaling Workshop, pp. 25-32, 2013.
  14. M. Noack, F. Wende, T. Steinke, and F. Cordes, "A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters," Proc. of the SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 203-214, 2014.
  15. S. Neuwirth, D. Frey, and U. Bruening, "Communication Models for Distributed Intel Xeon Phi Coprocessors," Proc. of the IEEE 21st International Conference onParallel and Distributed Systems, pp. 499-506, 2015.
  16. S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda, "Efficient Intra-node Communication on Intel-MIC Clusters," Proc. of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 128-135, 2013.
  17. A. Shimada, A. Hori, and Y. Ishikawa, "Eliminating Costs for Crossing Process Boundary from MPI Intra-node Communication," Proc. of the 21st European MPI Users' Group Meeting, pp. 119-120, 2014.
  18. K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy, and D. K. Panda, "Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters," Proc. of the IEEE 21st Annual Symposium on High- Performance Interconnects, pp. 63-70, 2013.
  19. A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and D. K. Panda, "High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters," Proc. of IEEE 28th International Parallel and Distributed Processing Symposium, pp. 637-646, 2014.
  20. MVAPICH, [Online]. Available: http://mvapich.cse.ohio-state.edu/
  21. memkind library, [Online]. Available: http://memkind.github.io/memkind/
  22. A. Kleen, "A numa api for linux," Novel Inc, 2005.