Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem …