Egocentric scene context for human-centric environment understanding from video

2022-07-22 22:39:57

Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

arXiv_CV

arXiv_CV Video_Caption Classification Pose 3D Scene_Classification Agent

Abstract
Abstract (translated)
URL
PDF

Abstract

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and only capture what is directly seen. We present an approach that links egocentric video and camera pose over time by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on real-world videos of house tours from unseen environments. We show that by grounding videos in their physical environment, our models surpass traditional scene classification models at predicting which room a camera-wearer is in (where frame-level information is insufficient), and can leverage this grounding to localize video moments corresponding to environment-centric queries, outperforming prior methods. Project page: this http URL

Abstract (translated)

URL

https://arxiv.org/abs/2207.11365

PDF

https://arxiv.org/pdf/2207.11365.pdf