Memory-Attended Recurrent Network for Video Captioning
Memory-Attended Recurrent Network for Video Captioning
Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle …