VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Cooperative Prompt …