CompCap: Improving Multimodal Large Language Models with Composite
Captions
CompCap: Improving Multimodal Large Language Models with Composite
Captions
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused …