Ask a Question

Prefer a chat interface with context about you and your work?

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially …