Ask a Question

Prefer a chat interface with context about you and your work?

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a …