References
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU Computing," in Proceedings of the IEEE, vol. 96, 2008, pp. 879-899. https://doi.org/10.1109/JPROC.2008.917757
- J. Vetter, "Toward exascale computational science with heterogeneous processing," in GPGPU '10: Proceedings of the 3rd Workshop on General- Purpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 1-1.
- B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data parallel architectures," IEEE Transactions on Parallel and Distributed Systems, 2010.
- M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, "Efficient computation of sumproducts on GPUs through softwaremanaged cache," in ICS '08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 309-318.
- K. Fatahalian, J. Sugerman, and P. Hanrahan, "Understanding the efficiency of GPU algorithms for matrix-matrix multiplication," in HWWS '04: Proceedings of the ACM SIGGRAPH/ EUROGRAPHICS conference on Graphics hardware. New York, NY, USA: ACM, 2004, pp. 133-137.
- C. Jiang and M. Snir, "Automatic tuning matrix multiplication performance on graphics hardware," in Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on, Sept. 2005, pp. 185-194.
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories," in PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, 2008, pp. 1-10.
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, et. al., "A compiler framework for optimization of affine loop nests for GPGPUs," in ICS '08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 225-234.
- NVIDIA, "NVIDIA CUDA C Programming Guide 4.2." [Online]. Available: {http://www.nvidia.com/cuda/}
- Khronos Group, "OpenCL 1.0 Specification," Dec. 2008. [Online]. Available: {http://www.khronos.org/opencl/}
- NVIDIA, "OpenCL Programming Guide for the CUDA Architecture," May 2010. [Online]. Available:{http://developer.nvidia.com/object/cuda_3_1_ downloads .html}
- AMD, "OpenCL Programming Guide," Jun 2013. [Online]. Available: {http://developer.amd.com/}
- NVIDIA, "GPU Computing SDK Code Samples 4.2." [Online]. Available: {www.nvidia.com/object/cuda develop.html}
- S. Ghosh, M. Martonosi, and S. Malik, "Cache miss equations: an analytical representation of cache misses," in ICS '97: Proceedings of the 11th international conference on Supercomputing. New York, NY, USA: ACM, 1997, pp. 317-324.
- M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in PLDI '91: Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation. New York, NY, USA: ACM, 1991, pp. 30-44.
- M. E. Wolf, M. S. Lam, "A loop transformation theory and an algorithm to maximize parallelism," IEEE Trans. Parallel Distrib. Syst., vol. 2, no. 4, pp. 452-471, 1991. https://doi.org/10.1109/71.97902
- AMD, "Stream Computing Forum." [Online]. Available: {http: //forums.amd.com/devforum/}
- NVIDIA, "CUDA Forum." [Online]. Available: {http://forums.nvidia.com/}
- B. Jang, D. Kaeli, S. Do, and H. Pien, "Multi GPU implementation of iterative tomographic reconstruction algorithms," in ISBI'09: Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging. Piscataway, NJ, USA: IEEE Press, 2009, pp. 185-188.
- M. Christiaens, B. De Sutter, K. De Bosschere, J. Van Campenhout, and I. Lemahieu, "A fast, cacheaware algorithm for the calculation of radiological paths exploiting subword parallelism," Journal of Systems Architecture, vol. 45, no. 10, pp. 781-790, 4 1999. https://doi.org/10.1016/S1383-7621(98)00038-1
- H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," in Computer Vision ECCV 2006, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006, vol. 3951, pp. 404-417.