Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of …