Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original …