Ask a Question

Prefer a chat interface with context about you and your work?

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory …