Lean Attention: Hardware-Aware Scalable Attention Mechanism for the
Decode-Phase of Transformers
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the
Decode-Phase of Transformers
Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge …