Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect …