Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

Type: Article

Publication Date: 2023-06-01

Citations: 14

DOI: https://doi.org/10.1109/cvpr52729.2023.01029

Abstract

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Locations

arXiv (Cornell University) - View - PDF
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action	Title	Year	Authors
+	Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation	2022	Tsu-Jui Fu Licheng Yu Ning Zhang Cheng-Yang Fu Jong-Chyi Su William Yang Wang Sean Bell
+ PDF Chat	TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation	2024	Weixi Feng Jiachen Li Michael Saxon Tsu-Jui Fu Wenhu Chen William Yang Wang
+ PDF Chat	VIMI: Grounding Video Generation through Multi-modal Instruction	2024	Yuwei Fang Willi Menapace Aliaksandr Siarohin Tsai-Shien Chen Kuan-Chien Wang Ivan Skorokhodov Graham Neubig Sergey Tulyakov
+ PDF Chat	Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance	2024	Jinbo Xing Menghan Xia Yuxin Liu Yuechen Zhang Yong Zhang Yingqing He Hanyuan Liu Haoxin Chen Xiaodong Cun Xintao Wang
+ PDF Chat	FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance	2024	Jiasong Feng Ao Ma Jing Wang Bo Cheng Xiaodan Liang Dawei Leng Yuhui Yin
+	Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance	2023	Jinbo Xing Menghan Xia Yuxin Liu Yuechen Zhang Yong Zhang Yingqing He Hanyuan Liu Haoxin Chen Xiaodong Cun Xintao Wang
+	WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens	2024	Xiaofeng Wang Zheng Zhu Guan Huang Bo-Yuan Wang Xinze Chen Jiwen Lu
+ PDF Chat	Motion Control for Enhanced Complex Action Video Generation	2024	Qiang Zhou Shaofeng Zhang Nianzu Yang Ye Qian Hao Li
+	VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	2023	Han Lin Abhay Zala Jaemin Cho Mohit Bansal
+	MTVG : Multi-text Video Generation with Text-to-Video Models	2023	Gyeongrok Oh Jaehwan Jeong Sieun Kim Wonmin Byeon Jinkyu Kim Sungwoong Kim Hyeokmin Kwon Sangpil Kim
+ PDF Chat	DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation	2024	Zun Wang Jialu Li Han Lin Jaehong Yoon Mohit Bansal
+ PDF Chat	VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation	2024	Chi Zhang Liang Yu Xintao Qiu Fei Yi Shichao Zhang
+	Make-A-Video: Text-to-Video Generation without Text-Video Data	2022	Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni
+	Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation	2023	Susung Hong Junyoung Seo Sunghwan Hong Heeseong Shin Seungryong Kim
+ PDF Chat	GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration	2024	Kai-Yi Huang Yunpeng Huang Xuyang Ning Zinan Lin Yu Wang Xihui Liu
+	Fine-grained Controllable Video Generation via Object Appearance and Context	2023	Hsin–Ping Huang Yu-Chuan Su Deqing Sun Lu Jiang Xuhui Jia Yukun Zhu Ming–Hsuan Yang
+ PDF Chat	Make It Move: Controllable Image-to-Video Generation with Text Descriptions	2022	Yaosi Hu Chong Luo Zhenzhong Chen
+	Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models	2023	Rohan Dhesikan V. Rajmohan
+ PDF Chat	T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation	2024	Kaiyue Sun Kaiyi Huang Liu Xian Yue Wu Zihan Xu Zhenguo Li Xihui Liu
+ PDF Chat	Text-Animator: Controllable Visual Text Video Generation	2024	Lin Liu Quande Liu Shengju Qian Yuan Zhou Wengang Zhou Houqiang Li Lingxi Xie Qi Tian

Works That Cite This (6)

Action	Title	Year	Authors
+ PDF Chat	Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models	2023	Dong Keun Lee Sangwon Jang Jaehyeong Jo Jaehong Yoon Yunji Kim Jin-Hwa Kim Jung-Woo Ha Sung Ju Hwang
+ PDF Chat	Vision + Language Applications: A Survey	2023	Yutong Zhou Nobutaka Shimada
+	Foundation Models for Video Understanding: A Survey	2024	Neelu Madan Andreas Møgelmose Rajat Modi Yogesh S Rawat Thomas B. Moeslund
+	Text-guided 3D Human Generation from 2D Collections	2023	Tsu-Jui Fu Wenhan Xiong Yixin Nie Jingyu Liu Barlas Oğuz William Wang
+ PDF Chat	Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought	2023	Vaishnavi Himakunthala Andy Ouyang Daniel M. Rose Ryan He Alex Mei Yujie Lu Chinmay Sonar Michael Saxon William Yang Wang
+ PDF Chat	Foundation Models for Video Understanding: A Survey	2024	Neelu Madan Andreas Møgelmose Rajat Modi Yogesh S Rawat Thomas B. Moeslund

Works Cited by This (81)

Action	Title	Year	Authors
+	Very Deep Convolutional Networks for Large-Scale Image Recognition	2014	Karen Simonyan Andrew Zisserman
+ PDF Chat	ImageSpirit	2014	Ming‐Ming Cheng Shuai Zheng Wen-Yan Lin Vibhav Vineet Paul Sturgess Nigel Crook Niloy J. Mitra Philip H. S. Torr
+	Conditional Generative Adversarial Nets	2014	Mehdi Mirza Simon Osindero
+ PDF Chat	Deep Residual Learning for Image Recognition	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+	Generative Adversarial Text to Image Synthesis	2016	Scott Reed Zeynep Akata Xinchen Yan Lajanugen Logeswaran Bernt Schiele Honglak Lee
+	UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild	2012	Khurram Soomro Amir Zamir Mubarak Shah
+ PDF Chat	Coherent Online Video Style Transfer	2017	Dongdong Chen Jing Liao Lu Yuan Nenghai Yu Gang Hua
+	Mixed Precision Training	2017	Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David García Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh
+	Scaling Egocentric Vision: The EPIC-KITCHENS Dataset	2018	Dima Damen Hazel Doughty Giovanni Maria Farinella Sanja Fidler Antonino Furnari Evangelos Kazakos Davide Moltisanti Jonathan Munro Toby Perrett Will Price
+	Learning to Decompose and Disentangle Representations for Video Prediction	2018	Jun-Ting Hsieh Bingbin Liu De-An Huang Li Fei-Fei Juan Carlos Niebles