Abstract
Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.
Abstract (translated)
物体识别和运动理解是感知中的关键组成部分,二者相辅相成。虽然自监督学习方法在从未标记数据中学习方面显示出巨大潜力,但它们主要专注于获取用于对象识别或运动的丰富表示,而不是两者并重。另一方面,在决策制定中使用的潜在动态建模旨在通过时间推移学习观察及其转换的隐式表示,以供控制和规划任务使用。在这项工作中,我们介绍了Midway Network,这是一种新的自监督学习架构,它首次仅从自然视频中同时学习出强大的视觉表征来支持物体识别和运动理解,并将潜在动态建模扩展到此领域。Midway Network利用一条中间的自顶向下路径,在视频帧之间推断运动隐式表示,同时还采用了密集前向预测目标和分层结构,以解决自然视频中的复杂多对象场景问题。我们证明了经过两个大规模自然视频数据集的预训练之后,Midway Network在语义分割和光流任务上相比先前的自监督学习方法具有更强的表现力。此外,我们展示了通过基于前向特征扰动的新分析方法发现,Midway Network所学得的动力学能够捕捉高层次的一致性。
URL
https://arxiv.org/abs/2510.05558