Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
We consider the problem of referring segmentation in images and videos with natural language. Given an input image (or video) and a referring expression, the goal is to segment the entity referred by the expression in the image or video. In this paper, we propose a cross-modal self-attention (CMSA) module …