KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions …