References
- Goto, K., van de Geijn, R.A. "Anatomy of high-performance matrix multiplication", ACM Transactions on Mathematical Software (TOMS) 34(3), 12 (2008) https://doi.org/10.1145/1356052.1356053
- Gunnels, J.A., Henry, G.M., Van De Geijn, R.A. "A family of highperformance matrix multiplication algorithms.", In: International Conference on Computational Science, pp. 51-60. Springer (2001)
- Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P. "Design and implementation of the linpack benchmark for single and multi-node systems based on Intel Xeon Phi Coprocessor" In: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp.126-137. IEEE (2013)
- "Intel Intrinsics Guide." Software.intel.com. (2018). [online] Available at: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ [Accessed 22 Mar. 2018].
- Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann (2016)
- Lim, R., Lee, Y., Kim, R., Choi, J. "An Implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512." In: Cluster Computing (Submitted)
- Peyton, J.L. "Programming dense linear algebra kernels on vectorized architectures." Master's thesis, The University of Tennessee, Knoxville (2013)
- Van Zee, F. G., van de Geijn, R. A. "BLIS: A Framework for Rapidly Instantiating BLAS Functionality" In: ACM Trans. Math. Softw., 41(3), pp.1-33. ACM (2015)
- Xianyi, Z., Qian, W., Yunquan, Z. "Model-driven level 3 BLAS performance optimization on Loongson 3A processor" In: Parallel and Distributed Systems, 2012 IEEE 18th International Conference, pp. 684-691. IEEE (2012)