Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a selfsupervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, …