Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align …