Ask a Question

Prefer a chat interface with context about you and your work?

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions …