Balanced Sparsity for Efficient DNN Inference on GPU
Balanced Sparsity for Efficient DNN Inference on GPU
In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. But …