AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
CLIP (Contrastive LanguageāImage Pretraining) is an English multimodal representation model learned from a massive amount of English text-image pairs and has achieved great success in various downstream tasks, including image classification, text-to-image retrieval, and image generation. When extending CLIP to other languages, the major problem is the lack of good-quality ā¦