Self-View Grounding Given a Narrated 360° Video

Shih-Han Chou, Yi‐Chun Chen, Kuo-Hao Zeng, Hou-Ning Hu, Jianlong Fu, Min Sun

Type: Article

Publication Date: 2018-04-27

Citations: 17

DOI: https://doi.org/10.1609/aaai.v32i1.12289

Abstract

Narrated 360° videos are typically provided in many touring scenarios to mimic real-world experience. However, previous work has shown that smart assistance (i.e., providing visual guidance) can significantly help users to follow the Normal Field of View (NFoV) corresponding to the narrative.In this project, we aim at automatically grounding the NFoVs of a 360° video given subtitles of the narrative (referred to as ''NFoV-grounding"). We propose a novel Visual Grounding Model (VGM) to implicitly and efficiently predict the NFoVs given the video content and subtitles. Specifically, at each frame, we efficiently encode the panorama into feature map of candidate NFoVs using a Convolutional Neural Network (CNN) and the subtitles to the same hidden space using an RNN with Gated Recurrent Units (GRU). Then, we apply soft-attention on candidate NFoVs to trigger sentence decoder aiming to minimize the reconstruct loss between the generated and given sentence. Finally, we obtain the NFoV as the candidate NFoV with the maximum attention without any human supervision.To train VGM more robustly, we also generate a reverse sentence conditioning on one minus the soft-attention such that the attention focuses on candidate NFoVs less relevant to the given sentence. The negative log reconstruction loss of the reverse sentence (referred to as ''irrelevant loss") is jointly minimized to encourage the reverse sentence to be different from the given sentence. To evaluate our method, we collect the first narrated 360° videos dataset and achieve state-of-the-art NFoV-grounding performance.

Locations

Proceedings of the AAAI Conference on Artificial Intelligence - View - PDF
arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	Self-view Grounding Given a Narrated 360° Video	2017	Shih-Han Chou Yi-Chun Chen Kuo-Hao Zeng Hou-Ning Hu Jianlong Fu Min Sun
+	Learning Temporal Sentence Grounding From Narrated EgoVideos	2023	Kevin C. Flanagan Dima Damen Michael Wray
+ PDF Chat	VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation	2024	Chi Zhang Liang Yu Xintao Qiu Fei Yi Shichao Zhang
+ PDF Chat	Movie101v2: Improved Movie Narration Benchmark	2024	Zihao Yue Zhang Yepeng Ziheng Wang Qin Jin
+	GroundNLQ @ Ego4D Natural Language Queries Challenge 2023	2023	Zhijian Hou Lei Ji Difei Gao Wanjun Zhong Kun Yan Chao Li W.K. Chan Chong‐Wah Ngo Nan Duan Mike Zheng Shou
+	What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions	2023	Brian Chen Nina Shvetsova Andrew Rouditchenko Daniel Kondermann Samuel Thomas Shih‐Fu Chang Rogério Feris James Glass Hilde Kuehne
+ PDF Chat	VideoAuteur: Towards Long Narrative Video Generation	2025	Junfei Xiao Feng Cheng Lu Qi Liangke Gui Jiepeng Cen Z Ma Alan Yuille Lu Jiang
+ PDF Chat	Towards Visual Grounding: A Survey	2024	Linhui Xiao Xiaoshan Yang Xingyu Lan Yaowei Wang Changsheng Xu
+	Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos	2021	Reuben Tan Bryan A. Plummer Kate Saenko Hailin Jin Bryan Russell
+	What You Say Is What You Show: Visual Narration Detection in Instructional Videos	2023	Kumar Ashutosh Rohit Girdhar Lorenzo Torresani Kristen Grauman
+	Visual Storytelling via Predicting Anchor Word Embeddings in the Stories	2020	Bowen Zhang Hexiang Hu Fei Sha
+ PDF Chat	Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline	2024	Dingyi Yang Chunru Zhan Ziheng Wang Biao Wang Tiezheng Ge Bo Zheng Qin Jin
+	STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding	2022	Zihang Lin Chaolei Tan Jianfang Hu Zhi Jin Tiancai Ye Wei‐Shi Zheng
+ PDF Chat	EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation	2024	Xiaofeng Wang Kang Zhao Feng Liu Jiayu Wang Guosheng Zhao Xiaoyi Bao Zheng Zhu Yingya Zhang Xingang Wang
+	Human-centric Spatio-Temporal Video Grounding With Visual Transformers	2020	Zongheng Tang Yue Liao Si Liu Guanbin Li Xiaojie Jin Hongxu Jiang Qian Yu Dong Xu
+ PDF Chat	Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network	2023	Haowei Wang Jiayi Ji Yiyi Zhou Yongjian Wu Xiaoshuai Sun
+ PDF Chat	Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning	2024	Yingjin Song Denis Paperno Albert Gatt
+	Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network	2023	Haowei Wang Jiayi Ji Yiyi Zhou Yongjian Wu Xiaoshuai Sun
+	Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network	2023	Yiming Lin Xiao-Bo Jin Qiufeng Wang Kaizhu Huang
+	Contextualize, Show and Tell: A Neural Visual Storyteller	2018	Diana Gonzalez-Rico Gibrán Fuentes-Pineda

Works That Cite This (9)

Action	Title	Year	Authors
+ PDF Chat	Visual Question Answering on 360° Images	2020	Shih-Han Chou Wei‐Lun Chao Wei‐Sheng Lai Min Sun Ming–Hsuan Yang
+	360-Indoor: Towards Learning Real-World Objects in 360° Indoor Equirectangular Images	2019	Shih-Han Chou Cheng Sun Wen‐Yen Chang Wan‐Ting Hsu Min Sun Jianlong Fu
+ PDF Chat	360-Indoor: Towards Learning Real-World Objects in 360° Indoor Equirectangular Images	2020	Shih-Han Chou Cheng Sun Wen‐Yen Chang Wan‐Ting Hsu Min Sun Jianlong Fu
+	Kernel Transformer Networks for Compact Spherical Convolution	2018	Yu-Chuan Su Kristen Grauman
+ PDF Chat	Kernel Transformer Networks for Compact Spherical Convolution	2019	Yu-Chuan Su Kristen Grauman
+ PDF Chat	Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications	2019	Arda Senocak Tae-Hyun Oh Junsik Kim Ming–Hsuan Yang In So Kweon
+ PDF Chat	Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos	2018	Hsien-Tzu Cheng Chun-Hung Chao Jin‐Dong Dong Hao-Kai Wen Tyng-Luh Liu Min Sun
+	Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos	2018	Hsien-Tzu Cheng Chun-Hung Chao Jin‐Dong Dong Hao-Kai Wen Tyng-Luh Liu Min Sun
+ PDF Chat	OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion	2022	Yuyan Li Yuliang Guo Zhixin Yan Xinyu Huang Ye Duan Liu Ren

Works Cited by This (14)

Action	Title	Year	Authors
+ PDF Chat	Show and tell: A neural image caption generator	2015	Oriol Vinyals Alexander Toshev Samy Bengio Dumitru Erhan
+ PDF Chat	Deep visual-semantic alignments for generating image descriptions	2015	Andrej Karpathy Li Fei-Fei
+	Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling	2014	Jun‐Young Chung Çaǧlar Gülçehre Kyunghyun Cho Yoshua Bengio
+	Deep Fragment Embeddings for Bidirectional Image Sentence Mapping	2014	Andrej Karpathy Armand Joulin Fei Fei F Li
+	Generation and Comprehension of Unambiguous Object Descriptions	2015	Junhua Mao Jonathan Huang Alexander Toshev Oana Camburu Alan Yuille Kevin Murphy
+ PDF Chat	Deep Residual Learning for Image Recognition	2016	Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
+ PDF Chat	Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries	2017	Yuting Zhang Luyao Yuan Yijie Guo Zhiyuan He Ian Huang Honglak Lee
+ PDF Chat	Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures	2017	Fanyi Xiao Leonid Sigal Yong Jae Lee
+ PDF Chat	Learning Deep Structure-Preserving Image-Text Embeddings	2016	Liwei Wang Yin Li Svetlana Lazebnik
+ PDF Chat	Natural Language Object Retrieval	2016	Ronghang Hu Huazhe Xu Marcus Rohrbach Jiashi Feng Kate Saenko Trevor Darrell