Ask a Question

Prefer a chat interface with context about you and your work?

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves …