Rotated Runtime Smooth: Training-Free Activation Smoother for accurate
INT4 inference
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate
INT4 inference
Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. …