Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.
机器人技术中的行动标记数据稀缺且昂贵,这限制了所学策略的泛化能力。相比之下,大量不含动作的数据(如视频)容易获得,但将这些观察结果转化为有效的政策仍然是一项挑战。我们引入了一个新的框架AMPLIFY,该框架利用大规模视频数据通过关键点轨迹来编码视觉动态为紧凑、离散的动作令牌。 我们的方法是模块化的,它分离了视觉运动预测与动作推理的过程,从而解耦了解决什么运动定义一个任务和机器人如何执行这一任务的问题。我们使用大量的无行动标记的视频训练了一个前向动力学模型,并利用少量带有行动标签的例子训练了一个逆向动力学模型,这使得每个模块可以独立地进行扩展。 广泛的评估显示,学习到的动力学既准确(与先前的方法相比,最大均方误差提高了3.7倍,像素预测准确性提高超过2.5倍),又具有广泛的应用价值。在下游策略学习中,我们的动态预测使低数据环境中的性能提升了1.2至2.2倍,在从无动作标记的人类视频中学到时平均提升了1.4倍,并且首次实现了零分布内行动数据下的LIBERO任务的泛化。 除了机器人控制之外,我们发现AMPLIFY学到的动力学是一个多功能的潜在世界模型,可以提高视频预测的质量。我们的结果展示了利用不同数据源构建高效、通用的世界模型的新范式。有关更多信息,请参阅此链接:[此URL](原文中的"this https URL"应替换为具体链接)。
https://arxiv.org/abs/2506.14198
Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.
密集视频预测任务,如物体跟踪和语义分割,需要生成每一帧上具有时间一致性且空间稠密特征的视频编码器。然而,现有方法存在不足:图像编码器(例如DINO或CLIP)缺乏对时间信息的理解,而诸如VideoMAE之类的视频模型在密集预测任务中表现不如图像编码器。我们通过引入FRAME来填补这一空白,这是一种针对密集视频理解自我监督视频帧编码器。FRAME学习从过去和当前的RGB帧中预测当前及未来的DINO补丁特征,从而生成空间上精确且时间上一致的表示。据我们所知,FRAME是第一个利用基于图像模型进行密集预测并超越它们在需要细粒度视觉对应的任务上的表现的视频编码器。作为辅助能力,FRAME将其类别令牌与CLIP的语义空间对齐,支持如视频分类等语言驱动任务。我们在六个不同的密集预测任务上使用七个数据集评估了FRAME,并发现其始终优于图像编码器和现有的自我监督视频模型。尽管具备多功能性,但FRAME仍保持紧凑型架构,适用于各种下游应用。
https://arxiv.org/abs/2506.05543
Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.
学习一种通用的双臂操作策略对于具身代理来说极为具有挑战性,原因在于庞大的动作空间以及需要协调的手臂运动。现有的方法依赖于视觉-语言-行动(VLA)模型来获取双臂策略。然而,从单臂数据集或预训练的VLA模型中转移知识往往难以有效泛化,主要是由于双臂数据稀缺和单臂与双臂操作之间存在本质差异所致。 在本文中,我们提出了一种新颖的双臂基础策略,通过微调领先的文本到视频模型来预测机器人轨迹,并训练一个轻量级扩散策略用于动作生成。鉴于文本到视频模型缺乏具身知识,我们引入了一个两阶段范式,以预训练的文本到视频模型为基础,对独立的文本到流(flow)和流到视频模型进行微调。具体来说,光学流动充当中间变量,提供图像之间细微运动的简洁表示。文本到流模型预测光学流动,将语言指令的具体意图形式化;而流到视频模型则利用此流动实现精细的视频预测。我们的方法减轻了一步式文本到视频预测中语言模糊性的问题,并通过避免直接使用低级动作显著减少了对机器人数据的需求。 在实验中,我们为实际的双臂机器人收集了高质量的操作数据,模拟和真实世界实验的结果展示了该方法的有效性。
https://arxiv.org/abs/2505.24156
Existing long-term video prediction methods often rely on an autoregressive video prediction mechanism. However, this approach suffers from error propagation, particularly in distant future frames. To address this limitation, this paper proposes the first AutoRegression-Free (ARFree) video prediction framework using diffusion models. Different from an autoregressive video prediction mechanism, ARFree directly predicts any future frame tuples from the context frame tuple. The proposed ARFree consists of two key components: 1) a motion prediction module that predicts a future motion using motion feature extracted from the context frame tuple; 2) a training method that improves motion continuity and contextual consistency between adjacent future frame tuples. Our experiments with two benchmark datasets show that the proposed ARFree video prediction framework outperforms several state-of-the-art video prediction methods.
现有的长期视频预测方法通常依赖于自回归的视频预测机制。然而,这种方法在处理遥远未来的帧时会遭受误差传播的问题。为了解决这一局限性,本文提出了第一个无自回归(ARFree)的视频预测框架,该框架使用扩散模型。不同于自回归视频预测机制,ARFree可以直接从上下文帧元组中预测任意未来帧元组。 所提出的ARFree包含两个关键组成部分:1) 一个运动预测模块,该模块利用从上下文帧元组提取的运动特征来预测未来的运动;2) 一种训练方法,这种方法可以改善相邻未来帧元组之间的运动连续性和上下文一致性。我们的实验表明,在两个基准数据集上,所提出的ARFree视频预测框架优于几种最新的视频预测方法。
https://arxiv.org/abs/2505.22111
Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.
扩散模型和流模型在各种模式的生成任务中取得了显著进展,并且最近被应用于世界建模。然而,与鼓励样本多样性的典型生成任务不同,世界模型涉及不同的不确定性来源,并需要与真实轨迹一致的样本,这是我们通过实验观察到的扩散模型的一个局限性。我们认为,在学习一致性扩散世界的模型时,关键瓶颈在于预测能力欠佳,这归因于在共享架构和联合训练方案中条件理解与目标去噪之间的纠缠。 为了解决这个问题,我们提出了“Foresight Diffusion”(简称ForeDiff),这是一种通过将条件理解和目标去噪解耦来增强一致性的基于扩散的世界建模框架。ForeDiff包含一个独立的确定性预测流,用于在不依赖于去噪流的情况下处理条件输入,并进一步利用预训练的预测器提取有助于生成过程的信息表示。 机器人视频预测和科学时空预报方面的广泛实验表明,相比于强大的基准模型,ForeDiff在提高预测准确性和样本一致性方面取得了显著进展,为基于扩散的世界建模提供了一个有前景的方向。
https://arxiv.org/abs/2505.16474
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
估计描述现实世界过程动态的世界模型的任务,在预测和准备未来结果方面具有巨大的重要性。在视频监控、机器人应用、自动驾驶等领域,这一目标意味着需要根据视频的几个帧来设置视觉上下文,并生成可信的未来图像序列。为此,我们提出了ProgGen方法,该方法通过利用大型语言模型(LLM/VLM)的归纳偏置,以一组神经符号化且易于人类解释的状态集(每帧一个状态)表示视频动态,从而进行视频帧预测任务。 具体而言,ProgGen使用LLM/VLM来合成程序: (i) 估计给定视觉上下文(即帧)下的视频状态; (ii) 根据推断出的过渡动力学来预测未来时间步的状态; (iii) 将预测的状态渲染为可视化的RGB图像帧。 实证评估表明,我们的方法在两个具有挑战性的环境中——PhyWorld和Cart Pole中的视频帧预测任务上优于竞争技术。此外,ProgGen还支持反事实推理,并生成可解释的视频,这证明了其在视频生成任务中有效性和泛化能力的有效性。
https://arxiv.org/abs/2505.14948
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at this https URL.
GPT在自然语言处理方面展示了其显著的成功,然而,单靠语言序列不足以描述视觉世界中的时空细节。相比之下,视频序列在这方面表现出色。鉴于此,本文提出了一种简洁的Video-GPT模型,通过将视频视为对视觉世界的建模新“语言”来实现这一目标。借鉴GPT中下一个标记预测的概念,我们为Video-GPT引入了一个新颖的下一镜头扩散预训练范式。与以往的工作不同的是,这种独特的范式使Video-GPT能够同时应对短期生成和长期预测任务:通过根据历史上的清晰视频片段自回归地去噪嘈杂的视频片段来实现这一目标。 广泛的实验表明,我们的Video-GPT在视频预测方面达到了最先进的性能(Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89),而视频预测是世界建模的关键因素。此外,在包括视频生成和理解在内的六个主流视频任务上,它也表现出良好的适应性,并展示了其在下游应用中的强大泛化能力。 项目页面位于此 [链接](https://this-is-the-project-url.com)。(请根据实际项目页面地址填写正确的URL)
https://arxiv.org/abs/2505.12489
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
这篇论文探讨了为机器人操作训练更优质的视觉世界模型的方法,即能够根据过去的帧和机器人的动作来预测未来的视觉观察结果的模型。具体来说,我们研究的是在RGB-D图像(RGB-D世界模型)上运行的世界模型。与传统的大多数处理动态预测的方法不同,这些方法大多是隐式的,并将这种预测与可视渲染结合在一个单一的模型中进行整合,我们引入了一种名为FlowDreamer的新方法,它采用3D场景流作为显式运动表示。 FlowDreamer首先使用U-Net从过去的帧和动作条件中预测3D场景流,然后一个扩散模型会利用这些场景流来预测未来的帧。尽管具有模块化结构,FlowDreamer仍然可以进行端到端的训练。我们在4个不同的基准测试上进行了实验,涵盖了视频预测和视觉规划任务。结果表明,在各种机器人操作领域中,FlowDreamer在语义相似性、像素质量和成功率方面分别比其他基线RGB-D世界模型提高了7%、11%和6%。
https://arxiv.org/abs/2505.10075
Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
从第一人称视角生成视频在增强现实和具身智能领域具有广泛的应用前景。在这项工作中,我们探讨了跨视图视频预测任务:给定一个外向型(exo-centric)的视频、相应的第一人称(ego-centric)视频的第一个帧以及文本指令,目标是根据这些信息生成后续第一人称视角下的视频帧。 鉴于手部与物体互动(Hand-Object Interaction, HOI)在第一人称视频中代表了当前操作者的首要意图和动作,我们提出了EgoExo-Gen模型,该模型明确地建模了手部与物体之间的动态关系以进行跨视图视频预测。EgoExo-Gen包含两个阶段: 1. 我们设计了一个跨视图HOI掩码预测模型,通过建模空间和时间上的第一人称和外向型视角对应关系来预估未来第一人称帧的HOI掩码。 2. 接下来,我们采用视频扩散模型根据给定的第一人称第一个帧和文本指令来预测未来的第一人称帧,并结合HOI掩码作为结构指导以增强生成的质量。 为了促进训练过程,我们开发了一种自动化管线,利用视觉基础模型为ego-视频和exo-视频自动生成伪HOI掩码。大量的实验表明,我们的EgoExo-Gen模型在Ego-Exo4D和H2O基准数据集上相比之前的视频预测模型具有更好的预测性能,并且通过引入HOI掩码显著提高了第一人称视角下手部及互动物体的生成质量。
https://arxiv.org/abs/2504.11732
Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.
视频预测(VP)通过利用过去帧的空间表示和时间上下文来生成未来的画面。传统的基于递归神经网络(RNN)的模型通过增强记忆单元结构以捕捉长时间跨度内的时空状态,但随着时间推移会逐渐丧失物体外观细节。为了解决这一问题,我们提出了强回忆视频预测(SRVP)模型,该模型集成了标准注意力(SA)和强化特征注意(RFA)模块。这两个模块均采用缩放点积注意力机制来提取时间上下文和空间相关性,并将这些信息融合起来以增强时空表示。在三个基准数据集上的实验表明,SRVP能够在不使用RNN的情况下同时减轻基于RNN的模型中的图像质量下降问题,并达到与无RNN架构相当的预测性能。
https://arxiv.org/abs/2504.08012
Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at this https URL.
模型预测控制(MPC)是一种广泛应用的控制范式,它利用预测模型来估计未来的系统状态,并据此优化控制输入。然而,尽管MPC在规划和控制方面表现出色,但它缺乏环境感知能力,在复杂和无结构场景中容易出现失败情况。为了解决这一局限性,我们引入了视觉-语言模型预测控制(VLMPC),这是一种机器人操作规划框架,它将视觉-语言模型(VLM)的感知能力与MPC相结合。VLMPC使用一个条件动作采样模块,该模块以目标图像或语言指令作为输入,并利用VLM生成候选的动作序列。这些候选动作被送入视频预测模型中,后者基于这些动作模拟未来的帧。此外,我们还提出了一种增强变体——Traj-VLMPC,它用运动轨迹生成替代视频预测,从而在保持准确性的同时减少计算复杂度。Traj-VLMPC根据候选动作估计运动动力学,为长时任务和实时应用提供了更高效的替代方案。VLMPC和Traj-VLMPC都使用基于VLM的分层成本函数来选择最优的动作序列,该函数捕捉当前观察与任务输入之间的像素级和知识级一致性。我们证明了这两种方法在公共基准测试中优于现有的最先进方法,并在各种现实世界机器人操作任务中表现出色。代码可以在提供的URL获取。
https://arxiv.org/abs/2504.05225
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.
我们解决的是从连续视频流中以自监督方式学习表示的挑战。这与标准的视频学习方法不同,后者在训练过程中将视频片段打乱并重新排序,以便创建一个不冗余且满足独立同分布(IID)样本假设的小批量数据集。当仅能获得连续输入视频时,IID假设显然被破坏,导致性能下降。我们通过三个任务展示了从打乱学习到顺序学习的性能下降:单个视频表示学习方法DoRA、标准VideoMAE在多视频数据集上的应用以及未来视频预测任务。 为了应对这一问题,我们提出了对标准优化器进行几何修改的方法,利用正交梯度在训练过程中去相关化批次。所提出的修改可以应用于任何优化器——我们在随机梯度下降(SGD)和AdamW上进行了演示。我们的正交优化器使从流式视频中训练的模型能够缓解表示学习性能下降的问题,并通过下游任务进行评估。在三个场景(DoRA、VideoMAE、未来预测)下,我们展示了我们的正交优化器在所有三种情况下均优于强大的AdamW。
https://arxiv.org/abs/2504.01961
Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at this http URL.
传输延迟显著影响用户在实时互动和操作中的体验质量。由于延迟不可避免,可以利用视频预测技术来减少延迟,并最终实现零延迟传输。然而,现有的大多数视频预测方法计算成本高昂,在实时应用中不切实际。为此,我们提出了一种名为 IFRVP(中间特征细化视频预测)的方法,旨在通过网络实现零延迟互动。 首先,我们为视频预测提出了三种训练方法,这些方法基于帧插值模型进行扩展,并使用了一个简单的仅包含卷积的帧插值网络,该网络基于IFRNet。其次,我们在预测模型中引入了基于ELAN(Efficient Lightweight Architecture Network)的残差块,以同时提高推理速度和准确性。 我们的评估表明,所提出的模型在现有的视频预测方法中表现出色,在预测准确性和计算速度之间实现了最佳权衡。此外,我们还提供了一个展示电影,地址为[此处请手动访问提供的URL]。 这个研究的目标是通过有效的视频预测技术来减少网络传输中的延迟问题,并且提高用户实时互动的体验质量。
https://arxiv.org/abs/2503.23185
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.
时间一致性在视频预测中至关重要,以确保输出的连贯性和无瑕疵。传统方法,如时序注意力和3D卷积,在处理显著的对象运动以及捕捉动态场景中的长程时序依赖关系方面可能会遇到困难。为了解决这一缺口,我们提出了Tracktention层,这是一个新颖的架构组件,它通过点轨迹(即帧间对应的点序列)显式地整合了运动信息。通过引入这些运动线索,Tracktention 层增强了时序对齐,并有效地处理复杂的对象运动,在整个时间段内保持一致的功能表示。我们的方法计算效率高,可以无缝集成到现有的模型中,如视觉变换器(Vision Transformers),只需进行少量修改即可。它可以用于将仅基于图像的模型升级为最先进的视频模型,在某些情况下甚至超过了原生设计用于视频预测的模型性能。我们在视频深度预测和视频上色任务中展示了这一点,其中增强有 Tracktention 层的模型在时间一致性方面显著优于基线模型。
https://arxiv.org/abs/2503.19904
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
几何重建与生成模型的集成仍然是开发能够进行类似人类空间推理的人工智能系统的关键挑战。本文提出了Aether,这是一个统一框架,通过联合优化三种核心能力来实现对世界模型具有几何意识的推理:(1)4D动态重建;(2)基于行动条件的视频预测;以及(3)基于目标条件的视觉规划。借助任务交错特征学习,Aether实现了跨重建、预测和规划目标的知识协同共享。在构建于视频生成模型的基础上,我们的框架即使从未接触过真实世界的数据,在合成到现实世界的泛化能力上也表现出前所未有的水平。此外,由于其内在的几何建模,我们的方法在遵循动作任务以及重建任务中都实现了零样本泛化(zero-shot generalization)。特别值得注意的是,在没有实际数据的情况下,它的重建性能远超领域特定模型。此外,Aether利用一个基于几何信息的动作空间来无缝地将预测转化为行动,从而实现有效的自主轨迹规划。我们希望我们的工作能够激励社区探索物理合理世界建模及其应用的新前沿。
https://arxiv.org/abs/2503.18945
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
预测未来的视频帧对于决策系统至关重要,然而仅依靠RGB图像通常无法充分捕捉现实世界的复杂性。为了解决这一局限性,我们提出了一种多模态框架——同步视频预测(SyncVP),该框架整合了互补的数据模式,增强了对未来预测的丰富性和准确性。SyncVP基于预训练的特定模式扩散模型,并引入了一个高效的时空交叉注意力模块,以促进不同模式间的信息共享。我们在标准基准数据集上评估了SyncVP的效果,例如Cityscapes和BAIR,在这些数据集中使用深度信息作为额外模态。此外,我们还展示了它在SYNTHIA(带有语义信息)和ERA5-Land(带有气候数据)上的跨其他模式的泛化能力。值得注意的是,即使在仅存在单一模态的情况下,SyncVP也实现了最先进的性能,这表明其鲁棒性和广泛应用潜力。
https://arxiv.org/abs/2503.18933
Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.
确保在辅助生活环境中老年人和弱势群体的安全与健康是至关重要的问题。计算机视觉通过视频监控提供了一种创新且强大的方法,利用人体行为识别(HAR)技术来预测健康风险。然而,实现高性能、高效率的实时人体动作预测是一个挑战。本研究提出了一种结合深度学习模型和实时视频预测及警报系统的实时人体行为识别模型,旨在预测辅助生活环境中居民的跌倒、踉跄和胸痛情况。从NTU RGB+D 60数据集中选择了六千个RGB视频样本,构建了一个包含四类(跌倒、踉跄、胸痛和正常)的数据集,其中“正常”类别包括40种日常活动。采用迁移学习技术在GPU服务器上训练了四种最先进的HAR模型:UniFormerV2、TimeSformer、I3D 和 SlowFast。本文根据各类别的性能指标(如F1分数)、宏观性能指标以及推理效率和模型复杂度及计算成本,展示了这四个模型的结果。基于其卓越的宏观F1得分(95.33%)、召回率(95.49%)和准确率(95.19%),并具有显著高于其他模型的推理吞吐量,TimeSformer被提议用于开发实时人体行为识别模型。这项研究为提高辅助生活环境中老年人及慢性病患者的安全与健康提供了见解,并促进了可持续护理、智能社区以及行业创新的发展。
https://arxiv.org/abs/2503.18957
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .
文本到视频预测(TVP)是一项下游视频生成任务,要求模型在给定一系列初始视频帧和描述所需运动的文本的情况下,产生后续的视频帧。实践中,TVP 方法主要关注一类特定类型的视频,这些视频展示了由人类或机械臂操作的对象操纵过程。先前的方法通常会调整那些预先训练于文本到图像任务的模型,因而生成的视频往往缺乏所需的连贯性。一种自然的发展方向是利用最近预训练的文本到视频(T2V)模型。然而,这种做法由于最常用的微调技术——低秩适应(LoRA)产生了不理想的结果而变得更加具有挑战性。 在本项工作中,我们提出了一种基于调整策略的方法,称为帧级条件适配(FCA)。在该模块内,我们设计了一个子模块,它能够从输入文本中生成每一帧的文本嵌入,这一额外的文字条件有助于视频生成。我们将使用FCA对T2V模型进行微调,并将其初始帧作为附加条件纳入其中。我们会比较和讨论将此类嵌入注入T2V模型中的更有效策略。我们进行了广泛的消融研究来评估设计选择的定量和定性性能。 我们的方法为TVP任务建立了新的最先进水平,项目页面在此[URL]。
https://arxiv.org/abs/2503.12953
Recent years, weather forecasting has gained significant attention. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolution operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection features in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.
近年来,天气预报受到了越来越多的关注。然而,由于气象数据的快速变化和潜在的远程联系(teleconnections),准确预测天气仍然是一个挑战。目前的空间时间预测模型主要依赖于卷积操作或滑动窗口来进行特征提取。这些方法受到卷积核大小或滑动窗口大小的限制,难以捕捉并识别气象数据中的潜在远程联系特征。此外,天气数据通常涉及非刚体物体,其运动过程伴随着不可预知的变形,进一步增加了预测任务的复杂性。 在本文中,我们提出了GMG模型来解决上述两个核心挑战。该模型的关键组成部分是全局关注模块(Global Focus Module),它增强了全局感受野;另一个重要部分是运动引导模块(Motion Guided Module),它可以适应非刚体物体的成长或消散过程。通过广泛的评估,我们的方法在各种复杂任务中展示了竞争性的性能,为提高复杂空间时间数据的预测准确性提供了一种新颖的方法。
https://arxiv.org/abs/2503.11297
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
在实际场景中训练视觉强化学习(RL)面临重大挑战,即在环境变化的情况下,RL代理的样本效率低下。尽管有许多方法尝试通过解耦表示学习来缓解这一问题,这些方法通常从零开始学习,并不使用世界知识。相比之下,本文试图通过离线到在线潜在蒸馏和灵活的解耦约束从分散视频中学习并理解底层语义变化。为了实现有效的跨域语义知识转移,我们引入了一个可解释的基于模型的RL框架,称为解耦世界模型(DisWM)。具体来说,我们在离线状态下使用带有解耦正则化的无动作视频预测模型进行预训练,以从分散视频中提取语义知识。接着,通过潜在蒸馏将预训练模型的解耦能力转移到世界模型上。在线环境中微调时,我们利用了预训练模型的知识,并向世界模型引入了解耦约束。在适应阶段,结合在线环境交互中的动作和奖励数据丰富了数据多样性,进一步增强了解耦表示学习。实验结果验证了我们的方法在多个基准测试上的优越性。
https://arxiv.org/abs/2503.08751