+
|
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
|
2020
|
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
|
3
|
+
PDF
Chat
|
Deep Residual Learning for Image Recognition
|
2016
|
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
|
3
|
+
PDF
Chat
|
A Style-Based Generator Architecture for Generative Adversarial Networks
|
2019
|
Tero Karras
Samuli Laine
Timo Aila
|
3
|
+
|
Is Space-Time Attention All You Need for Video Understanding?
|
2021
|
Gedas Bertasius
Heng Wang
Lorenzo Torresani
|
2
|
+
PDF
Chat
|
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
|
2021
|
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
|
2
|
+
|
Decoupled Weight Decay Regularization
|
2017
|
Ilya Loshchilov
Frank Hutter
|
2
|
+
|
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
|
2020
|
Linjie Li
YenâChun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
|
2
|
+
PDF
Chat
|
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
|
2019
|
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Ĺ ivic
|
2
|
+
PDF
Chat
|
Hierarchical Conditional Relation Networks for Video Question Answering
|
2020
|
Thao Minh Le
Vuong Le
Svetha Venkatesh
Truyen Tran
|
2
|
+
|
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
|
2020
|
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
Thomas Unterthiner
Mostafa Dehghani
Matthias Minderer
Georg Heigold
Sylvain Gelly
|
2
|
+
PDF
Chat
|
Image Super-Resolution Via Iterative Refinement
|
2022
|
Chitwan Saharia
Jonathan Ho
William Chan
Tim Salimans
David J. Fleet
Mohammad Norouzi
|
2
|
+
PDF
Chat
|
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
|
2021
|
Weihao Xia
Yujiu Yang
JingâHao Xue
Baoyuan Wu
|
2
|
+
PDF
Chat
|
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
|
2017
|
Youngjae Yu
Hyungjin Ko
Jongwook Choi
Gunhee Kim
|
2
|
+
PDF
Chat
|
ActBERT: Learning Global-Local Video-Text Representations
|
2020
|
Linchao Zhu
Yi Yang
|
2
|
+
PDF
Chat
|
Motion-Appearance Co-memory Networks for Video Question Answering
|
2018
|
Jiyang Gao
Runzhou Ge
Kan Chen
Ram Nevatia
|
2
|
+
|
YouTube-8M: A Large-Scale Video Classification Benchmark
|
2016
|
Sami Abu-El-Haija
Nisarg Kothari
Joonseok Lee
Apostol Natsev
George Toderici
Balakrishnan Varadarajan
Sudheendra Vijayanarasimhan
|
2
|
+
|
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
|
2019
|
Yang Liu
Samuel Albanie
Arsha Nagrani
Andrew Zisserman
|
2
|
+
PDF
Chat
|
VideoBERT: A Joint Model for Video and Language Representation Learning
|
2019
|
Chen Sun
Austin Myers
Carl Vondrick
Kevin Murphy
Cordelia Schmid
|
2
|
+
PDF
Chat
|
Unified Vision-Language Pre-Training for Image Captioning and VQA
|
2020
|
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
|
2
|
+
|
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
|
2020
|
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Dong Li
Furu Wei
|
2
|
+
PDF
Chat
|
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
|
2020
|
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Ĺ ivic
Andrew Zisserman
|
2
|
+
PDF
Chat
|
UNITER: UNiversal Image-TExt Representation Learning
|
2020
|
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
|
2
|
+
PDF
Chat
|
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
|
2018
|
Richard Zhang
Phillip Isola
Alexei A. Efros
Eli Shechtman
Oliver Wang
|
2
|
+
PDF
Chat
|
Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation
|
2021
|
Elad Richardson
Yuval Alaluf
Or Patashnik
Yotam Nitzan
Yaniv Azar
Stav Shapiro
Daniel CohenâOr
|
2
|
+
PDF
Chat
|
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
|
2018
|
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Jay Gould
Lei Zhang
|
2
|
+
PDF
Chat
|
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
|
2021
|
Max Bain
Arsha Nagrani
GĂźl Varol
Andrew Zisserman
|
2
|
+
PDF
Chat
|
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
|
2017
|
Yunseok Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
|
2
|
+
PDF
Chat
|
Multi-modal Transformer for Video Retrieval
|
2020
|
Valentin Gabeur
Chen Sun
Karteek Alahari
Cordelia Schmid
|
2
|
+
|
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
|
2018
|
Antoine Miech
Ivan Laptev
Josef Ĺ ivic
|
2
|
+
PDF
Chat
|
In Defense of Grid Features for Visual Question Answering
|
2020
|
Huaizu Jiang
Ishan Misra
Marcus Rohrbach
Erik Learned-Miller
Xinlei Chen
|
2
|
+
PDF
Chat
|
Localizing Moments in Video with Natural Language
|
2017
|
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Ĺ ivic
Trevor Darrell
Bryan Russell
|
2
|
+
PDF
Chat
|
SlowFast Networks for Video Recognition
|
2019
|
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
|
2
|
+
PDF
Chat
|
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
|
2018
|
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Murphy
|
2
|
+
PDF
Chat
|
Learning Spatiotemporal Features with 3D Convolutional Networks
|
2015
|
Du Tran
Lubomir Bourdev
Rob Fergus
Lorenzo Torresani
Manohar Paluri
|
2
|
+
|
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
|
2017
|
Martin Heusel
Hubert Ramsauer
Thomas Unterthiner
Bernhard Nessler
Sepp Hochreiter
|
2
|
+
PDF
Chat
|
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
|
2018
|
Youngjae Yu
Jong-Seok Kim
Gunhee Kim
|
2
|
+
|
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
|
2019
|
Hao Tan
Mohit Bansal
|
2
|
+
|
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
|
2018
|
Jacob Devlin
MingâWei Chang
Kenton Lee
Kristina Toutanova
|
2
|
+
|
Support-set bottlenecks for video-text representation learning
|
2020
|
Mandela Patrick
Po-Yao Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
JoĂŁo F. Henriques
Andrea Vedaldi
|
2
|
+
PDF
Chat
|
Cross-Modal and Hierarchical Modeling of Video and Text
|
2018
|
Bowen Zhang
Hexiang Hu
Fei Sha
|
2
|
+
PDF
Chat
|
Image Inpainting for Irregular Holes Using Partial Convolutions
|
2018
|
Guilin Liu
Fitsum A. Reda
Kevin J. Shih
Ting-Chun Wang
Andrew Tao
Bryan Catanzaro
|
1
|
+
PDF
Chat
|
SPICE: Semantic Propositional Image Caption Evaluation
|
2016
|
Peter Anderson
Basura Fernando
Mark Johnson
Stephen Jay Gould
|
1
|
+
|
Towards Automatic Learning of Procedures from Web Instructional Videos
|
2017
|
Luowei Zhou
Chenliang Xu
Jason J. Corso
|
1
|
+
|
A Note on the Inception Score
|
2018
|
Shane Barratt
Rishi Sharma
|
1
|
+
|
Improved Techniques for Training GANs
|
2016
|
Tim Salimans
Ian Goodfellow
Wojciech Zaremba
Vicki Cheung
Alec Radford
Xi Chen
|
1
|
+
PDF
Chat
|
Going deeper with convolutions
|
2015
|
Christian Szegedy
Wei Liu
Yangqing Jia
Pierre Sermanet
Scott Reed
Dragomir Anguelov
Dumitru Erhan
Vincent Vanhoucke
Andrew Rabinovich
|
1
|
+
|
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
|
2017
|
Fartash Faghri
David J. Fleet
Jamie Kiros
Sanja Fidler
|
1
|
+
|
Conditional Image Generation with PixelCNN Decoders
|
2016
|
Aäron van den Oord
Nal Kalchbrenner
Oriol Vinyals
Lasse Espeholt
Alex Graves
Koray Kavukcuoglu
|
1
|
+
|
Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space
|
2016
|
AnhâTu Nguyen
Jeff Clune
Yoshua Bengio
Alexey Dosovitskiy
Jason Yosinski
|
1
|
+
|
Dense-Captioning Events in Videos
|
2017
|
Ranjay Krishna
Kenji Hata
Frederic Ren
Li Fei-Fei
Juan Carlos Niebles
|
1
|