Multimodal Convolutional Neural Networks for Matching Image and Sentence
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content …