Video Action Transformer Network

Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

Type: Article

Publication Date: 2019-06-01

Citations: 707

DOI: https://doi.org/10.1109/cvpr.2019.00033

Abstract

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Locations

arXiv (Cornell University) - View - PDF
Oxford University Research Archive (ORA) (University of Oxford) - View - PDF
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action	Title	Year	Authors
+	Video Action Transformer Network	2018	Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman
+	Video Action Transformer Network	2018	Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman
+	Object-Region Video Transformers	2021	Roei Herzig Elad Ben-Avraham Karttikeya Mangalam Amir Bar Gal Chechik Anna Rohrbach Trevor Darrell Amir Globerson
+ PDF Chat	Object-Region Video Transformers	2022	Roei Herzig Elad Ben-Avraham Karttikeya Mangalam Amir Bar Gal Chechik Anna Rohrbach Trevor Darrell Amir Globerson
+	Vision Transformers for Action Recognition: A Survey	2022	Anwaar Ulhaq Naveed Akhtar Ganna Pogrebna Ajmal Mian
+	Deformable Video Transformer	2022	Jue Wang Lorenzo Torresani
+	ActionFormer: Localizing Moments of Actions with Transformers	2022	Chenlin Zhang Jianxin Wu Yin Li
+ PDF Chat	Deformable Video Transformer	2022	Jue Wang Lorenzo Torresani
+	Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention	2020	Juan-Manuel Pérez-Rúa Brais Martínez Xiatian Zhu Antoine Toisoul Víctor Escorcia Tao Xiang
+	Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention	2020	Juan-Manuel Pérez-Rúa Brais Martínez Xiatian Zhu Antoine Toisoul Víctor Escorcia Tao Xiang
+	VideoLightFormer: Lightweight Action Recognition using Transformers	2021	Raivo E. Koot Haiping Lu
+	GTA: Global Temporal Attention for Video Action Understanding	2020	Bo He Xitong Yang Zuxuan Wu Hao Chen Ser-Nam Lim Abhinav Shrivastava
+	Hierarchical Attention Network for Action Recognition in Videos	2016	Yilin Wang Suhang Wang Jiliang Tang Neil O’Hare Yi Chang Baoxin Li
+	Lightweight Network Architecture for Real-Time Action Recognition	2019	Alexander Kozlov Vadim Andronov Yana Gritsenko
+	Lightweight Network Architecture for Real-Time Action Recognition	2019	Alexander Kozlov Vadim Andronov Yana Gritsenko
+	Lightweight network architecture for real-time action recognition	2020	Alexander Kozlov Vadim Andronov Yana Gritsenko
+ PDF Chat	End-to-End Temporal Action Detection With Transformer	2022	Xiaolong Liu Qimeng Wang Yao Hu Xu Tang Shiwei Zhang Song Bai Xiang Bai
+	Knowledge Fusion Transformers for Video Action Recognition	2020	Ganesh Samarth Sheetal Ojha Nikhil Pareek
+	Knowledge Fusion Transformers for Video Action Recognition	2020	Ganesh Samarth Sheetal Ojha Nikhil Pareek
+ PDF Chat	Evaluating Transformers for Lightweight Action Recognition	2021	Raivo Koot Markus Hennerbichler Haiping Lu

Works That Cite This (350)

Action	Title	Year	Authors
+ PDF Chat	Lightweight Delivery Detection on Doorbell Cameras	2024	Pirazh Khorramshahi Zhe Wu Tianchen Wang Luke DeLuccia Hongcheng Wang
+	Revisiting spatio-temporal layouts for compositional action recognition.	2021	Gorjan Radevski Marie‐Francine Moens Tinne Tuytelaars
+	Transformers for prompt-level EMA non-response prediction.	2021	Supriya Nagesh Alexander Moreno Stephanie M. Carpenter Jamie Yap Soujanya Chatterjee Steven Lloyd Lizotte Neng Wan Santosh Kumar Cho Y. Lam David W. Wetter
+	With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition	2021	Evangelos Kazakos Jaesung Huh Arsha Nagrani Andrew Zisserman Dima Damen
+ PDF Chat	TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval	2022	Yongbiao Chen Sheng Zhang Fangxin Liu Zhigang Chang Mang Ye Zhengwei Qi
+ PDF Chat	(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering	2022	Anoop Cherian Chiori Hori Tim K. Marks Jonathan Le Roux
+ PDF Chat	Holistic Interaction Transformer Network for Action Detection	2023	Gueter Josmy Faure Min-Hung Chen Shang‐Hong Lai
+ PDF Chat	FAR: Fourier Aerial Video Recognition	2022	Divya Kothandaraman Tianrui Guan Xijun Wang Shuowen Hu Ming C. Lin Dinesh Manocha
+ PDF Chat	Prompt Guided Transformer for Multi-Task Dense Prediction	2024	Yuxiang Lu Shalayiding Sirejiding Yue Ding Chunlin Wang Hongtao Lu
+ PDF Chat	Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering	2023	Yang Liu Guanbin Li Liang Lin

Works Cited by This (39)

Action	Title	Year	Authors
+ PDF Chat	Deep Residual Learning for Image Recognition	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+	Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding	2016	Gunnar A. Sigurdsson Gül Varol Xiaolong Wang Ali Farhadi Ivan Laptev Abhinav Gupta
+	UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild	2012	Khurram Soomro Amir Zamir Mubarak Shah
+ PDF Chat	Temporal Segment Networks: Towards Good Practices for Deep Action Recognition	2016	Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao Dahua Lin Xiaoou Tang Luc Van Gool
+	YouTube-8M: A Large-Scale Video Classification Benchmark	2016	Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Apostol Natsev George Toderici Balakrishnan Varadarajan Sudheendra Vijayanarasimhan
+ PDF Chat	Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors	2017	Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu Anoop Korattikara Alireza Fathi Ian Fischer Zbigniew Wojna Yang Song Sergio Guadarrama
+ PDF Chat	Feature Pyramid Networks for Object Detection	2017	Tsung-Yi Lin Piotr Dollár Ross Girshick Kaiming He Bharath Hariharan Serge Belongie
+ PDF Chat	Asynchronous Temporal Fields for Action Recognition	2017	Gunnar A. Sigurdsson Santosh Divvala Ali Farhadi Abhinav Gupta
+ PDF Chat	ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification	2017	Rohit Girdhar Deva Ramanan Abhinav Gupta Josef Šivic Bryan Russell
+ PDF Chat	Action Tubelet Detector for Spatio-Temporal Action Localization	2017	Vicky Kalogeiton Philippe Weinzaepfel Vittorio Ferrari Cordelia Schmid