Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are …