TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
Vision-Language Negatives
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
Vision-Language Negatives
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning …