Ask a Question

Prefer a chat interface with context about you and your work?

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input …