Ask a Question

Prefer a chat interface with context about you and your work?

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. …