Prompting Visual-Language Models for Efficient Video Understanding

Type: Book-Chapter

Publication Date: 2022-01-01

Citations: 187

DOI: https://doi.org/10.1007/978-3-031-19833-5_7

Locations

  • Lecture notes in computer science - View
  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives 2024 Thông Nguyen
Bin Yi
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
+ PDF Chat From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding 2024 Hengchao Zou
T. Luo
Gen-Qing Xie
Victor
Zhang
Fangfang Lv
G. Q. Wang
J. Chen
Zhuochen Wang
Hai‐Feng Zhang
+ PDF Chat PruneVid: Visual Token Pruning for Efficient Video Large Language Models 2024 Xiaohu Huang
Hao Zhou
K. L. Han
+ Video Understanding with Large Language Models: A Survey 2023 Yun‐Long Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
Teng Wang
Daoan Zhang
Jie An
Jingyang Lin
Rongyi Zhu
+ PDF Chat Semantically-Prompted Language Models Improve Visual Descriptions 2024 Michael Ogezi
Bradley Hauer
Grzegorz Kondrak
+ VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding 2021 Hu Xu
Gargi Ghosh
Po-Yao Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
+ TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding 2023 Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
+ PDF Chat LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs 2024 Yunxin Li
Xinyu Chen
Baotain Hu
Min Zhang
+ PDF Chat Unifying Specialized Visual Encoders for Video Language Models 2025 J.-Y. Chung
Tyler Zhu
Max Gonzalez Saez-Diez
Juan Carlos Niebles
Honglu Zhou
Olga Russakovsky
+ PDF Chat Goldfish: Vision-Language Understanding of Arbitrarily Long Videos 2024 Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Mingchen Zhuge
Jian Ding
Deyao Zhu
Jürgen Schmidhuber
Mohamed Elhoseiny
+ PDF Chat Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models 2024 Jinhui Yi
Syed Talal Wasim
Yan-An Luo
Muzammal Naseer
Jüergen Gall
+ VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding 2023 Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
+ RTQ: Rethinking Video-language Understanding Based on Image-text Model 2023 Xiao Wang
Yaoyu Li
Tian Gan
Zheng Zhang
Jingjing Lv
Liqiang Nie
+ PDF Chat Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input 2024 Jiajun Liu
Yibing Wang
Hanghang Ma
Xiaoping Wu
Xiaoqi Ma
Xiaoming Wei
Jianbin Jiao
Enhua Wu
Jie Hu
+ PDF Chat CogVLM2: Visual Language Models for Image and Video Understanding 2024 Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
Yan Wang
Yean Cheng
Shiyu Huang
Junhui Ji
Xue Zhao
+ PDF Chat MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens 2024 Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Deyao Zhu
Jian Ding
Mohamed Elhoseiny
+ VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models 2023 Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
+ PDF Chat 3VL: Using Trees to Improve Vision-Language Models’ Interpretability 2025 Nir Yellinek
Leonid Karlinsky
Raja Giryes
+ Retrieval-based Video Language Model for Efficient Long Video Question Answering 2023 Jiaqi Xu
Cuiling Lan
Wenxuan Xie
Xuejin Chen
Yan Lu
+ PDF Chat Visual Context Window Extension: A New Perspective for Long Video Understanding 2024 Hu Wei
Chen Zhen-zhong

Works That Cite This (77)

Action Title Year Authors
+ PDF Chat VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval 2023 Siteng Huang
Biao Gong
Yulin Pan
Jianwen Jiang
Yiliang Lv
Yuyuan Li
Donglin Wang
+ PDF Chat ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition 2023 Soumyabrata Chaudhuri
Saumik Bhattacharya
+ PDF Chat Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment 2024 Hao Fei
Shengqiong Wu
Meishan Zhang
Min Zhang
Tat‐Seng Chua
Shuicheng Yan
+ PDF Chat MaPLe: Multi-modal Prompt Learning 2023 Muhammad Uzair Khattak
Hanoona Rasheed
Muhammad Maaz
Salman Khan
Fahad Shahbaz Khan
+ Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation 2023 Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
Yiwei Ma
Minda Zhao
Lincheng Li
Zeng Zhao
Tangjie Lv
+ PDF Chat PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models 2022 Yuan Yao
Qianyu Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun
+ PDF Chat VideoXum: Cross-Modal Visual and Textural Summarization of Videos 2023 Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jen-Hao Hsiao
Chiuman Ho
Jiebo Luo
+ PDF Chat Event-Guided Procedure Planning from Instructional Videos with Text Supervision 2023 An-Lan Wang
Kun-Yu Lin
Jia-Run Du
Jingke Meng
Wei‐Shi Zheng
+ PDF Chat Large Language Models are Temporal and Causal Reasoners for Video Question Answering 2023 Dohwan Ko
Ji Eun Lee
Wooyoung Kang
Byungseok Roh
Hyunwoo Kim
+ CPT: Color-based Prompt Tuning for pre-trained vision-language models 2024 Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun

Works Cited by This (51)

Action Title Year Authors
+ PDF Chat Objects2action: Classifying and Localizing Actions without Any Video Example 2015 Mihir Jain
Jan van Gemert
Thomas Mensink
Cees G. M. Snoek
+ PDF Chat Convolutional Two-Stream Network Fusion for Video Action Recognition 2016 Christoph Feichtenhofer
Axel Pinz
Andrew Zisserman
+ PDF Chat Learning Attributes Equals Multi-Source Domain Generalization 2016 Chuang Gan
Tianbao Yang
Boqing Gong
+ PDF Chat Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 2016 Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
+ PDF Chat CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos 2017 Zheng Shou
Jonathan Chan
Alireza Zareian
Kazuyuki Miyazawa
Shih‐Fu Chang
+ PDF Chat Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification 2018 Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Murphy
+ PDF Chat A Joint Sequence Fusion Model for Video Question Answering and Retrieval 2018 Youngjae Yu
Jong-Seok Kim
Gunhee Kim
+ PDF Chat BSN: Boundary Sensitive Network for Temporal Action Proposal Generation 2018 Tianwei Lin
Xu Zhao
Haisheng Su
Chongjing Wang
Ming Yang
+ PDF Chat Rethinking the Faster R-CNN Architecture for Temporal Action Localization 2018 Yu-Wei Chao
Sudheendra Vijayanarasimhan
Bryan Seybold
David A. Ross
Jia Deng
Rahul Sukthankar
+ PDF Chat Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? 2018 Kensho Hara
Hirokatsu Kataoka
Yutaka Satoh