LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial
Description
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial
Description
Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a …