3VL: Using Trees to Improve Vision-Language Models’ Interpretability
3VL: Using Trees to Improve Vision-Language Models’ Interpretability
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, …