Ask a Question

Prefer a chat interface with context about you and your work?

3VL: Using Trees to Improve Vision-Language Models’ Interpretability

3VL: Using Trees to Improve Vision-Language Models’ Interpretability

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, …