Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on high-quality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, …