Paper Reading AI Learner

Learning from Streaming Video with Orthogonal Gradients

2025-04-02 17:59:57
Tengda Han, Dilara Gokay, Joseph Heyward, Chuhan Zhang, Daniel Zoran, Viorica P\u{a}tr\u{a}ucean, Jo\~ao Carreira, Dima Damen, Andrew Zisserman

Abstract

We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Abstract (translated)

我们解决的是从连续视频流中以自监督方式学习表示的挑战。这与标准的视频学习方法不同,后者在训练过程中将视频片段打乱并重新排序,以便创建一个不冗余且满足独立同分布(IID)样本假设的小批量数据集。当仅能获得连续输入视频时,IID假设显然被破坏,导致性能下降。我们通过三个任务展示了从打乱学习到顺序学习的性能下降:单个视频表示学习方法DoRA、标准VideoMAE在多视频数据集上的应用以及未来视频预测任务。 为了应对这一问题,我们提出了对标准优化器进行几何修改的方法,利用正交梯度在训练过程中去相关化批次。所提出的修改可以应用于任何优化器——我们在随机梯度下降(SGD)和AdamW上进行了演示。我们的正交优化器使从流式视频中训练的模型能够缓解表示学习性能下降的问题,并通过下游任务进行评估。在三个场景(DoRA、VideoMAE、未来预测)下,我们展示了我们的正交优化器在所有三种情况下均优于强大的AdamW。

URL

https://arxiv.org/abs/2504.01961

PDF

https://arxiv.org/pdf/2504.01961.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot