+
PDF
Chat
|
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
|
2021
|
Max Bain
Arsha Nagrani
GĂźl Varol
Andrew Zisserman
|
3
|
+
PDF
Chat
|
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
|
2019
|
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Ĺ ivic
|
3
|
+
|
Decoupled Weight Decay Regularization
|
2017
|
Ilya Loshchilov
Frank Hutter
|
2
|
+
|
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
|
2018
|
Antoine Miech
Ivan Laptev
Josef Ĺ ivic
|
2
|
+
|
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
|
2020
|
Linjie Li
YenâChun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
|
2
|
+
PDF
Chat
|
Deep Residual Learning for Image Recognition
|
2016
|
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
|
2
|
+
PDF
Chat
|
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
|
2018
|
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Murphy
|
2
|
+
PDF
Chat
|
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
|
2017
|
Youngjae Yu
Hyungjin Ko
Jongwook Choi
Gunhee Kim
|
2
|
+
|
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
|
2019
|
Yang Liu
Samuel Albanie
Arsha Nagrani
Andrew Zisserman
|
2
|
+
|
Support-set bottlenecks for video-text representation learning
|
2020
|
Mandela Patrick
Po-Yao Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
JoĂŁo F. Henriques
Andrea Vedaldi
|
2
|
+
|
Is Space-Time Attention All You Need for Video Understanding?
|
2021
|
Gedas Bertasius
Heng Wang
Lorenzo Torresani
|
2
|
+
PDF
Chat
|
Image Super-Resolution Via Iterative Refinement
|
2022
|
Chitwan Saharia
Jonathan Ho
William Chan
Tim Salimans
David J. Fleet
Mohammad Norouzi
|
2
|
+
PDF
Chat
|
A Style-Based Generator Architecture for Generative Adversarial Networks
|
2019
|
Tero Karras
Samuli Laine
Timo Aila
|
2
|
+
PDF
Chat
|
VideoBERT: A Joint Model for Video and Language Representation Learning
|
2019
|
Chen Sun
Austin Myers
Carl Vondrick
Kevin Murphy
Cordelia Schmid
|
2
|
+
PDF
Chat
|
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
|
2021
|
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Mohit Bansal
Jingjing Liu
|
2
|
+
PDF
Chat
|
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
|
2021
|
Weihao Xia
Yujiu Yang
JingâHao Xue
Baoyuan Wu
|
2
|
+
PDF
Chat
|
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
|
2018
|
Youngjae Yu
Jong-Seok Kim
Gunhee Kim
|
2
|
+
PDF
Chat
|
Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation
|
2021
|
Elad Richardson
Yuval Alaluf
Or Patashnik
Yotam Nitzan
Yaniv Azar
Stav Shapiro
Daniel CohenâOr
|
2
|
+
PDF
Chat
|
Hierarchical Conditional Relation Networks for Video Question Answering
|
2020
|
Thao Minh Le
Vuong Le
Svetha Venkatesh
Truyen Tran
|
2
|
+
PDF
Chat
|
ActBERT: Learning Global-Local Video-Text Representations
|
2020
|
Linchao Zhu
Yi Yang
|
2
|
+
PDF
Chat
|
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
|
2021
|
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
|
2
|
+
PDF
Chat
|
Motion-Appearance Co-memory Networks for Video Question Answering
|
2018
|
Jiyang Gao
Runzhou Ge
Kan Chen
Ram Nevatia
|
2
|
+
PDF
Chat
|
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
|
2020
|
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Ĺ ivic
Andrew Zisserman
|
2
|
+
PDF
Chat
|
Localizing Moments in Video with Natural Language
|
2017
|
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Ĺ ivic
Trevor Darrell
Bryan Russell
|
2
|
+
PDF
Chat
|
Multi-modal Transformer for Video Retrieval
|
2020
|
Valentin Gabeur
Chen Sun
Karteek Alahari
Cordelia Schmid
|
2
|
+
PDF
Chat
|
SlowFast Networks for Video Recognition
|
2019
|
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
|
2
|
+
PDF
Chat
|
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
|
2017
|
Yunseok Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
|
2
|
+
PDF
Chat
|
UNITER: UNiversal Image-TExt Representation Learning
|
2020
|
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
|
2
|
+
PDF
Chat
|
Cross-Modal and Hierarchical Modeling of Video and Text
|
2018
|
Bowen Zhang
Hexiang Hu
Fei Sha
|
2
|
+
PDF
Chat
|
Learning Spatiotemporal Features with 3D Convolutional Networks
|
2015
|
Du Tran
Lubomir Bourdev
Rob Fergus
Lorenzo Torresani
Manohar Paluri
|
2
|
+
|
YouTube-8M: A Large-Scale Video Classification Benchmark
|
2016
|
Sami Abu-El-Haija
Nisarg Kothari
Joonseok Lee
Apostol Natsev
George Toderici
Balakrishnan Varadarajan
Sudheendra Vijayanarasimhan
|
2
|
+
|
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
|
2019
|
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
|
1
|
+
PDF
Chat
|
What Makes A Good Story? Designing Composite Rewards for Visual Storytelling
|
2020
|
Junjie Hu
Yu Cheng
Zhe Gan
Jingjing Liu
Jianfeng Gao
Graham Neubig
|
1
|
+
PDF
Chat
|
Local Aggregation for Unsupervised Learning of Visual Embeddings
|
2019
|
Chengxu Zhuang
Alex Zhai
Daniel Yamins
|
1
|
+
|
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
|
2020
|
Di Qi
Lin Su
Jia Song
Edward Cui
Taroon Bharti
Arun Sacheti
|
1
|
+
|
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.
|
2020
|
Huaishao Luo
Lei Ji
Botian Shi
Haoyang Huang
Nan Duan
Tianrui Li
Xilin Chen
Ming Zhou
|
1
|
+
|
REALM: Retrieval-Augmented Language Model Pre-Training
|
2020
|
Kelvin Guu
Kenton Lee
Zora Tung
Panupong Pasupat
MingâWei Chang
|
1
|
+
|
XGPT: Cross-modal Generative Pre-Training for Image Captioning
|
2020
|
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
|
1
|
+
|
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
|
2020
|
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
|
1
|
+
|
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
|
2020
|
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Dong Li
Furu Wei
|
1
|
+
|
Language Models are Few-Shot Learners
|
2020
|
T. B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
|
1
|
+
PDF
Chat
|
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
|
2020
|
Anyi Rao
Linning Xu
Yu Xiong
Guodong Xu
Qingqiu Huang
Bolei Zhou
Dahua Lin
|
1
|
+
|
VirTex: Learning Visual Representations from Textual Annotations
|
2020
|
Karan Desai
Justin Johnson
|
1
|
+
PDF
Chat
|
Momentum Contrast for Unsupervised Visual Representation Learning
|
2020
|
Kaiming He
Haoqi Fan
Yuxin Wu
Saining Xie
Ross Girshick
|
1
|
+
|
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
|
2020
|
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
|
1
|
+
|
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
|
2020
|
F. Richard Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
|
1
|
+
|
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders
|
2020
|
Nicola Messina
Giuseppe Amato
Andrea Esuli
Fabrizio Falchi
Claudio Gennaro
StĂŠphane MarchandâMaillet
|
1
|
+
|
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
|
2020
|
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
|
1
|
+
PDF
Chat
|
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
|
2020
|
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Dong Li
Furu Wei
|
1
|
+
|
Emerging Trends of Multimodal Research in Vision and Language.
|
2020
|
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumder
Soujanya Poria
Roger Zimmermann
Amir Zadeh
|
1
|