Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., dressing) to the action (e.g., mix yogurt) that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression …