Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially …