Ask a Question

Prefer a chat interface with context about you and your work?

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original …