Ask a Question

Prefer a chat interface with context about you and your work?

Harnessing Webpage UIs for Text-Rich Visual Understanding

Harnessing Webpage UIs for Text-Rich Visual Understanding

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct …