Video Action Transformer Network

Type: Article

Publication Date: 2019-06-01

Citations: 707

DOI: https://doi.org/10.1109/cvpr.2019.00033

Abstract

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Locations

  • arXiv (Cornell University) - View - PDF
  • Oxford University Research Archive (ORA) (University of Oxford) - View - PDF
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action Title Year Authors
+ Video Action Transformer Network 2018 Rohit Girdhar
JoĂŁo Carreira
Carl Doersch
Andrew Zisserman
+ Video Action Transformer Network 2018 Rohit Girdhar
JoĂŁo Carreira
Carl Doersch
Andrew Zisserman
+ Object-Region Video Transformers 2021 Roei Herzig
Elad Ben-Avraham
Karttikeya Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
+ PDF Chat Object-Region Video Transformers 2022 Roei Herzig
Elad Ben-Avraham
Karttikeya Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
+ Vision Transformers for Action Recognition: A Survey 2022 Anwaar Ulhaq
Naveed Akhtar
Ganna Pogrebna
Ajmal Mian
+ Deformable Video Transformer 2022 Jue Wang
Lorenzo Torresani
+ ActionFormer: Localizing Moments of Actions with Transformers 2022 Chenlin Zhang
Jianxin Wu
Yin Li
+ PDF Chat Deformable Video Transformer 2022 Jue Wang
Lorenzo Torresani
+ Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention 2020 Juan-Manuel PĂ©rez-RĂșa
Brais MartĂ­nez
Xiatian Zhu
Antoine Toisoul
VĂ­ctor Escorcia
Tao Xiang
+ Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention 2020 Juan-Manuel PĂ©rez-RĂșa
Brais MartĂ­nez
Xiatian Zhu
Antoine Toisoul
VĂ­ctor Escorcia
Tao Xiang
+ VideoLightFormer: Lightweight Action Recognition using Transformers 2021 Raivo E. Koot
Haiping Lu
+ GTA: Global Temporal Attention for Video Action Understanding 2020 Bo He
Xitong Yang
Zuxuan Wu
Hao Chen
Ser-Nam Lim
Abhinav Shrivastava
+ Hierarchical Attention Network for Action Recognition in Videos 2016 Yilin Wang
Suhang Wang
Jiliang Tang
Neil O’Hare
Yi Chang
Baoxin Li
+ Lightweight Network Architecture for Real-Time Action Recognition 2019 Alexander Kozlov
Vadim Andronov
Yana Gritsenko
+ Lightweight Network Architecture for Real-Time Action Recognition 2019 Alexander Kozlov
Vadim Andronov
Yana Gritsenko
+ Lightweight network architecture for real-time action recognition 2020 Alexander Kozlov
Vadim Andronov
Yana Gritsenko
+ PDF Chat End-to-End Temporal Action Detection With Transformer 2022 Xiaolong Liu
Qimeng Wang
Yao Hu
Xu Tang
Shiwei Zhang
Song Bai
Xiang Bai
+ Knowledge Fusion Transformers for Video Action Recognition 2020 Ganesh Samarth
Sheetal Ojha
Nikhil Pareek
+ Knowledge Fusion Transformers for Video Action Recognition 2020 Ganesh Samarth
Sheetal Ojha
Nikhil Pareek
+ PDF Chat Evaluating Transformers for Lightweight Action Recognition 2021 Raivo Koot
Markus Hennerbichler
Haiping Lu

Works That Cite This (350)

Action Title Year Authors
+ PDF Chat Lightweight Delivery Detection on Doorbell Cameras 2024 Pirazh Khorramshahi
Zhe Wu
Tianchen Wang
Luke DeLuccia
Hongcheng Wang
+ Revisiting spatio-temporal layouts for compositional action recognition. 2021 Gorjan Radevski
Marie‐Francine Moens
Tinne Tuytelaars
+ Transformers for prompt-level EMA non-response prediction. 2021 Supriya Nagesh
Alexander Moreno
Stephanie M. Carpenter
Jamie Yap
Soujanya Chatterjee
Steven Lloyd Lizotte
Neng Wan
Santosh Kumar
Cho Y. Lam
David W. Wetter
+ With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition 2021 Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
+ PDF Chat TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval 2022 Yongbiao Chen
Sheng Zhang
Fangxin Liu
Zhigang Chang
Mang Ye
Zhengwei Qi
+ PDF Chat (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering 2022 Anoop Cherian
Chiori Hori
Tim K. Marks
Jonathan Le Roux
+ PDF Chat Holistic Interaction Transformer Network for Action Detection 2023 Gueter Josmy Faure
Min-Hung Chen
Shang‐Hong Lai
+ PDF Chat FAR: Fourier Aerial Video Recognition 2022 Divya Kothandaraman
Tianrui Guan
Xijun Wang
Shuowen Hu
Ming C. Lin
Dinesh Manocha
+ PDF Chat Prompt Guided Transformer for Multi-Task Dense Prediction 2024 Yuxiang Lu
Shalayiding Sirejiding
Yue Ding
Chunlin Wang
Hongtao Lu
+ PDF Chat Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering 2023 Yang Liu
Guanbin Li
Liang Lin

Works Cited by This (39)

Action Title Year Authors
+ PDF Chat Deep Residual Learning for Image Recognition 2016 Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
+ Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding 2016 Gunnar A. Sigurdsson
GĂŒl Varol
Xiaolong Wang
Ali Farhadi
Ivan Laptev
Abhinav Gupta
+ UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild 2012 Khurram Soomro
Amir Zamir
Mubarak Shah
+ PDF Chat Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 2016 Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
+ YouTube-8M: A Large-Scale Video Classification Benchmark 2016 Sami Abu-El-Haija
Nisarg Kothari
Joonseok Lee
Apostol Natsev
George Toderici
Balakrishnan Varadarajan
Sudheendra Vijayanarasimhan
+ PDF Chat Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors 2017 Jonathan Huang
Vivek Rathod
Chen Sun
Menglong Zhu
Anoop Korattikara
Alireza Fathi
Ian Fischer
Zbigniew Wojna
Yang Song
Sergio Guadarrama
+ PDF Chat Feature Pyramid Networks for Object Detection 2017 Tsung-Yi Lin
Piotr DollĂĄr
Ross Girshick
Kaiming He
Bharath Hariharan
Serge Belongie
+ PDF Chat Asynchronous Temporal Fields for Action Recognition 2017 Gunnar A. Sigurdsson
Santosh Divvala
Ali Farhadi
Abhinav Gupta
+ PDF Chat ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification 2017 Rohit Girdhar
Deva Ramanan
Abhinav Gupta
Josef Ć ivic
Bryan Russell
+ PDF Chat Action Tubelet Detector for Spatio-Temporal Action Localization 2017 Vicky Kalogeiton
Philippe Weinzaepfel
Vittorio Ferrari
Cordelia Schmid