OWL : Localizing Actions in Egocentric Video via Audiovisual Temporal Context

2022-02-10 10:50:52

Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao, Bernard Ghanem

arXiv_SD

Abstract
Abstract (translated)
URL
PDF

Abstract

Temporal action localization (TAL) is an important task extensively explored and improved for third-person videos in recent years. Recent efforts have been made to perform fine-grained temporal localization on first-person videos. However, current TAL methods only use visual signals, neglecting the audio modality that exists in most videos and that shows meaningful action information in egocentric videos. In this work, we take a deep look into the effectiveness of audio in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL) to leverage audio-visual information and context for egocentric TAL. For doing that, we: 1) compare and study different strategies for where and how to fuse the two modalities; 2) propose a transformer-based model to incorporate temporal audio-visual context. Our experiments show that our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.

Abstract (translated)

URL

https://arxiv.org/abs/2202.04947

PDF

https://arxiv.org/pdf/2202.04947.pdf