Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., …