DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and
Effective for LMMs
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and
Effective for LMMs
Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input …