Paper Reading AI Learner

Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

2025-10-07 04:07:44
Christopher Hoang, Mengye Ren

Abstract

Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

Abstract (translated)

物体识别和运动理解是感知中的关键组成部分,二者相辅相成。虽然自监督学习方法在从未标记数据中学习方面显示出巨大潜力,但它们主要专注于获取用于对象识别或运动的丰富表示,而不是两者并重。另一方面,在决策制定中使用的潜在动态建模旨在通过时间推移学习观察及其转换的隐式表示,以供控制和规划任务使用。在这项工作中,我们介绍了Midway Network,这是一种新的自监督学习架构,它首次仅从自然视频中同时学习出强大的视觉表征来支持物体识别和运动理解,并将潜在动态建模扩展到此领域。Midway Network利用一条中间的自顶向下路径,在视频帧之间推断运动隐式表示,同时还采用了密集前向预测目标和分层结构,以解决自然视频中的复杂多对象场景问题。我们证明了经过两个大规模自然视频数据集的预训练之后,Midway Network在语义分割和光流任务上相比先前的自监督学习方法具有更强的表现力。此外,我们展示了通过基于前向特征扰动的新分析方法发现,Midway Network所学得的动力学能够捕捉高层次的一致性。

URL

https://arxiv.org/abs/2510.05558

PDF

https://arxiv.org/pdf/2510.05558.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot