Ask a Question

Prefer a chat interface with context about you and your work?

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on high-quality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, …