Self-View Grounding Given a Narrated 360° Video

Type: Article

Publication Date: 2018-04-27

Citations: 17

DOI: https://doi.org/10.1609/aaai.v32i1.12289

Abstract

Narrated 360° videos are typically provided in many touring scenarios to mimic real-world experience. However, previous work has shown that smart assistance (i.e., providing visual guidance) can significantly help users to follow the Normal Field of View (NFoV) corresponding to the narrative.In this project, we aim at automatically grounding the NFoVs of a 360° video given subtitles of the narrative (referred to as ''NFoV-grounding"). We propose a novel Visual Grounding Model (VGM) to implicitly and efficiently predict the NFoVs given the video content and subtitles. Specifically, at each frame, we efficiently encode the panorama into feature map of candidate NFoVs using a Convolutional Neural Network (CNN) and the subtitles to the same hidden space using an RNN with Gated Recurrent Units (GRU). Then, we apply soft-attention on candidate NFoVs to trigger sentence decoder aiming to minimize the reconstruct loss between the generated and given sentence. Finally, we obtain the NFoV as the candidate NFoV with the maximum attention without any human supervision.To train VGM more robustly, we also generate a reverse sentence conditioning on one minus the soft-attention such that the attention focuses on candidate NFoVs less relevant to the given sentence. The negative log reconstruction loss of the reverse sentence (referred to as ''irrelevant loss") is jointly minimized to encourage the reverse sentence to be different from the given sentence. To evaluate our method, we collect the first narrated 360° videos dataset and achieve state-of-the-art NFoV-grounding performance.

Locations

  • Proceedings of the AAAI Conference on Artificial Intelligence - View - PDF
  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Self-view Grounding Given a Narrated 360° Video 2017 Shih-Han Chou
Yi-Chun Chen
Kuo-Hao Zeng
Hou-Ning Hu
Jianlong Fu
Min Sun
+ Learning Temporal Sentence Grounding From Narrated EgoVideos 2023 Kevin C. Flanagan
Dima Damen
Michael Wray
+ PDF Chat VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation 2024 Chi Zhang
Liang Yu
Xintao Qiu
Fei Yi
Shichao Zhang
+ PDF Chat Movie101v2: Improved Movie Narration Benchmark 2024 Zihao Yue
Zhang Yepeng
Ziheng Wang
Qin Jin
+ GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 2023 Zhijian Hou
Lei Ji
Difei Gao
Wanjun Zhong
Kun Yan
Chao Li
W.K. Chan
Chong‐Wah Ngo
Nan Duan
Mike Zheng Shou
+ What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions 2023 Brian Chen
Nina Shvetsova
Andrew Rouditchenko
Daniel Kondermann
Samuel Thomas
Shih‐Fu Chang
Rogério Feris
James Glass
Hilde Kuehne
+ PDF Chat VideoAuteur: Towards Long Narrative Video Generation 2025 Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Z Ma
Alan Yuille
Lu Jiang
+ PDF Chat Towards Visual Grounding: A Survey 2024 Linhui Xiao
Xiaoshan Yang
Xingyu Lan
Yaowei Wang
Changsheng Xu
+ Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos 2021 Reuben Tan
Bryan A. Plummer
Kate Saenko
Hailin Jin
Bryan Russell
+ What You Say Is What You Show: Visual Narration Detection in Instructional Videos 2023 Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
+ Visual Storytelling via Predicting Anchor Word Embeddings in the Stories 2020 Bowen Zhang
Hexiang Hu
Fei Sha
+ PDF Chat Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline 2024 Dingyi Yang
Chunru Zhan
Ziheng Wang
Biao Wang
Tiezheng Ge
Bo Zheng
Qin Jin
+ STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding 2022 Zihang Lin
Chaolei Tan
Jianfang Hu
Zhi Jin
Tiancai Ye
Wei‐Shi Zheng
+ PDF Chat EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation 2024 Xiaofeng Wang
Kang Zhao
Feng Liu
Jiayu Wang
Guosheng Zhao
Xiaoyi Bao
Zheng Zhu
Yingya Zhang
Xingang Wang
+ Human-centric Spatio-Temporal Video Grounding With Visual Transformers 2020 Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
+ PDF Chat Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network 2023 Haowei Wang
Jiayi Ji
Yiyi Zhou
Yongjian Wu
Xiaoshuai Sun
+ PDF Chat Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning 2024 Yingjin Song
Denis Paperno
Albert Gatt
+ Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network 2023 Haowei Wang
Jiayi Ji
Yiyi Zhou
Yongjian Wu
Xiaoshuai Sun
+ Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network 2023 Yiming Lin
Xiao-Bo Jin
Qiufeng Wang
Kaizhu Huang
+ Contextualize, Show and Tell: A Neural Visual Storyteller 2018 Diana Gonzalez-Rico
Gibrán Fuentes-Pineda