Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Type: Article

Publication Date: 2023-06-01

Citations: 14

DOI: https://doi.org/10.1109/cvpr52729.2023.01029

Abstract

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Locations

  • arXiv (Cornell University) - View - PDF
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action Title Year Authors
+ Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation 2022 Tsu-Jui Fu
Licheng Yu
Ning Zhang
Cheng-Yang Fu
Jong-Chyi Su
William Yang Wang
Sean Bell
+ PDF Chat TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation 2024 Weixi Feng
Jiachen Li
Michael Saxon
Tsu-Jui Fu
Wenhu Chen
William Yang Wang
+ PDF Chat VIMI: Grounding Video Generation through Multi-modal Instruction 2024 Yuwei Fang
Willi Menapace
Aliaksandr Siarohin
Tsai-Shien Chen
Kuan-Chien Wang
Ivan Skorokhodov
Graham Neubig
Sergey Tulyakov
+ PDF Chat Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance 2024 Jinbo Xing
Menghan Xia
Yuxin Liu
Yuechen Zhang
Yong Zhang
Yingqing He
Hanyuan Liu
Haoxin Chen
Xiaodong Cun
Xintao Wang
+ PDF Chat FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance 2024 Jiasong Feng
Ao Ma
Jing Wang
Bo Cheng
Xiaodan Liang
Dawei Leng
Yuhui Yin
+ Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance 2023 Jinbo Xing
Menghan Xia
Yuxin Liu
Yuechen Zhang
Yong Zhang
Yingqing He
Hanyuan Liu
Haoxin Chen
Xiaodong Cun
Xintao Wang
+ WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens 2024 Xiaofeng Wang
Zheng Zhu
Guan Huang
Bo-Yuan Wang
Xinze Chen
Jiwen Lu
+ PDF Chat Motion Control for Enhanced Complex Action Video Generation 2024 Qiang Zhou
Shaofeng Zhang
Nianzu Yang
Ye Qian
Hao Li
+ VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning 2023 Han Lin
Abhay Zala
Jaemin Cho
Mohit Bansal
+ MTVG : Multi-text Video Generation with Text-to-Video Models 2023 Gyeongrok Oh
Jaehwan Jeong
Sieun Kim
Wonmin Byeon
Jinkyu Kim
Sungwoong Kim
Hyeokmin Kwon
Sangpil Kim
+ PDF Chat DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation 2024 Zun Wang
Jialu Li
Han Lin
Jaehong Yoon
Mohit Bansal
+ PDF Chat VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation 2024 Chi Zhang
Liang Yu
Xintao Qiu
Fei Yi
Shichao Zhang
+ Make-A-Video: Text-to-Video Generation without Text-Video Data 2022 Uriel Singer
Adam Polyak
Thomas Hayes
Xi Yin
Jie An
Songyang Zhang
Qiyuan Hu
Harry Yang
Oron Ashual
Oran Gafni
+ Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation 2023 Susung Hong
Junyoung Seo
Sunghwan Hong
Heeseong Shin
Seungryong Kim
+ PDF Chat GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration 2024 Kai-Yi Huang
Yunpeng Huang
Xuyang Ning
Zinan Lin
Yu Wang
Xihui Liu
+ Fine-grained Controllable Video Generation via Object Appearance and Context 2023 Hsin–Ping Huang
Yu-Chuan Su
Deqing Sun
Lu Jiang
Xuhui Jia
Yukun Zhu
Ming–Hsuan Yang
+ PDF Chat Make It Move: Controllable Image-to-Video Generation with Text Descriptions 2022 Yaosi Hu
Chong Luo
Zhenzhong Chen
+ Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models 2023 Rohan Dhesikan
V. Rajmohan
+ PDF Chat T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation 2024 Kaiyue Sun
Kaiyi Huang
Liu Xian
Yue Wu
Zihan Xu
Zhenguo Li
Xihui Liu
+ PDF Chat Text-Animator: Controllable Visual Text Video Generation 2024 Lin Liu
Quande Liu
Shengju Qian
Yuan Zhou
Wengang Zhou
Houqiang Li
Lingxi Xie
Qi Tian