Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the …