SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Recursive Basic Linear Algebra Operations on TensorCore GPU


Workshop:11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Authors: Shaoshuai Zhang, Vivek Karihaloo, and Panruo Wu (University of Houston)


Abstract: Encouraged by the requirement of high speed matrix computations and training deep neural networks, TensorCore was introduced in NVIDIA GPU to further accelerate matrix-matrix multiplication. It supports very fast half precision general matrix matrix multiplications (GEMMs), which is around 8x faster then single precision CUDA core GEMMs. So far the use of TensorCore GPU for matrix operations other than matrix-matrix multiplication is under developed. In this paper, we propose efficient BLAS3 operations that exploits TensorCore. The experimental results show that the proposed algorithms outperform cublas corresponding routines and the naive TensorCore implementation with up to 4.7x speedup.





Back to 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Archive Listing



Back to Full Workshop Archive Listing