TURBOATTENTION: Efficient Attention Approximation For High Throughputs
LLMs
TURBOATTENTION: Efficient Attention Approximation For High Throughputs
LLMs
Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves …