UniDual: A Unified Model for Image and Video Understanding

Abstract
Abstract (translated)
URL
PDF

Abstract

Although a video is effectively a sequence of images, visual perception systems typically model images and videos separately, thus failing to exploit the correlation and the synergy provided by these two media. While a few prior research efforts have explored the benefits of leveraging still-image datasets for video analysis, or vice-versa, most of these attempts have been limited to pretraining a model on one type of visual modality and then adapting it via finetuning on the other modality. In contrast, in this paper we introduce a framework that enables joint training of a unified model on mixed collections of image and video examples spanning different tasks. The key ingredient in our architecture design is a new network block, which we name UniDual. It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos. For video input, the point-wise filtering implements a temporal convolution. For image input, it performs a pixel-wise nonlinear transformation. Repeated stacking of such blocks gives rise to a network where images and videos undergo partially distinct execution pathways, unified by spatial convolutions (capturing commonalities in visual appearance) but separated by point-wise operations (modeling patterns specific to each modality). Extensive experiments on Kinetics and ImageNet demonstrate that our UniDual model jointly trained on these datasets yields substantial accuracy gains for both tasks, compared to 1) training separate models, 2) traditional multi-task learning and 3) the conventional framework of pretraining-followed-by-finetuning. On Kinetics, the UniDual architecture applied to a state-of-the-art video backbone model (R(2+1)D-152) yields an additional video@1 accuracy gain of 1.5%.

Abstract (translated)

尽管视频实际上是一系列图像，视觉感知系统通常将图像和视频分开建模，因此无法利用这两种媒体提供的相关性和协同作用。虽然之前的一些研究工作已经探索了利用静止图像数据集进行视频分析（反之亦然）的好处，但大多数尝试都局限于在一种视觉模式上对模型进行预培训，然后通过在另一种模式上进行微调来调整模型。相比之下，本文介绍了一个框架，该框架能够对跨越不同任务的图像和视频示例的混合集合进行统一模型的联合训练。我们的架构设计的关键要素是一个新的网络块，我们称之为Unidual。它由一个共享的二维空间卷积和两个平行的点向卷积层组成，一个用于图像，另一个用于视频。对于视频输入，点位滤波实现时间卷积。对于图像输入，它执行逐像素非线性变换。重复堆叠这些块会产生一个网络，其中图像和视频经过部分不同的执行路径，通过空间卷积（在视觉外观中捕捉共性）统一，但通过点操作（针对每个模态的建模模式）分离。大量的动力学和Imagenet实验表明，我们在这些数据集上联合训练的单一模型，与1）训练单独的模型，2）传统的多任务学习和3）传统的预训练后微调框架相比，两个任务都获得了相当大的精度提高。在动力学方面，应用于最先进的视频主干模型（R（2+1）D-152）的单一架构产生额外的视频@1精度增益1.5%。

URL

https://arxiv.org/abs/1906.03857

PDF

https://arxiv.org/pdf/1906.03857.pdf