Ask a Question

Prefer a chat interface with context about you and your work?

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect …