Ask a Question

Prefer a chat interface with context about you and your work?

Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs

Modern large language models (LLMs) have established state-of-the-art performance through architectural improvements, but still require significant computational cost for inference. In an effort to reduce the inference cost, post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision, such as INT8. In this paper, we …