Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available at this https URL.
预测未来场景表示对于使机器人理解和与环境互动至关重要。然而,大多数现有方法依赖于带有精确动作标注的视频序列和模拟数据,这限制了它们利用大量未标记视频数据的能力。为了解决这一挑战,我们提出了PlaySlot,这是一种以物体为中心的视频预测模型,可以从未标记的视频序列中推断出对象表示和潜在的动作。然后,它使用这些表示来预测未来对象状态和视频帧。PlaySlot可以生成基于从视频动态、用户提供的或由学习到的动作策略生成的潜在动作的多个可能的未来场景,从而实现灵活且可解释的世界建模。我们的结果显示,与不同环境下的视频预测基准相比,PlaySlot在随机基线和以物体为中心的模型中都表现出色。此外,我们展示了通过未标记的视频演示,可以利用推断出的潜在动作有效地学习机器人的行为。更多详细信息、视频及代码可在提供的链接中找到。
https://arxiv.org/abs/2502.07600
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link this https URL for more information.
我们提出了一种异构掩码自回归(HMA)方法,用于建模动作视频动态,并生成高质量的数据和评估以扩展机器人学习。由于需要处理不同设置的多样性并保持实时计算效率,为机器人建立互动视频世界模型和策略具有挑战性。HMA利用来自不同机器人实体、领域和任务中的观测数据和行动序列进行异构预训练。HMA通过掩码自回归生成量化或软令牌用于视频预测。 \ourshort 模型比之前的机器人视频生成模型在视觉真实感和控制能力上都有所提高,且其实时运行速度提高了15倍。经过后期训练后,该模型可以作为从低级动作输入生成合成数据并评估策略的视频模拟器使用。更多详情请参阅此链接:[https URL]。 请注意,原文中的“\ourshort”可能是指代某个特定版本或简写形式,在翻译时保留了其原始格式,如果需要进一步解释,请提供上下文信息以便准确处理。
https://arxiv.org/abs/2502.04296
Temporal sequence modeling stands as the fundamental foundation for video prediction systems and real-time forecasting operations as well as anomaly detection applications. The achievement of accurate predictions through efficient resource consumption remains an ongoing issue in contemporary temporal sequence modeling. We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adversarial Networks (GANs) and spatio-temporal attention mechanisms to improve video frame prediction capabilities. Our approach implements three types of attention models to capture intricate motion sequences. A dynamic combination of these attention outputs allows the model to reach both advanced decision accuracy along with superior quality while remaining computationally efficient. The integration of GAN elements makes generated frames appear more true to life therefore the framework creates output sequences which mimic real-world footage. The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction. Through a comprehensive evaluation methodology which merged the perceptual LPIPS measurement together with classic tests MSE, MAE, SSIM and PSNR exhibited enhancing capabilities than contemporary approaches based on direct benchmark tests of Moving MNIST, KTH Action, and CASIA-B (Preprocessed) datasets. Our examination indicates that MAUCell shows promise for operational time requirements. The research findings demonstrate how GANs work best with attention mechanisms to create better applications for predicting video sequences.
时间序列建模是视频预测系统、实时预测操作及异常检测应用的基础。在当代时间序列建模中,如何通过高效利用资源实现准确预测仍是一个亟待解决的问题。我们提出了一种结合生成对抗网络(GANs)和时空注意力机制的多注意单元模型(MAUCell),以提升视频帧预测能力。本方法实现了三种类型的注意力模型来捕捉复杂的运动序列。动态组合这些注意力输出使模型能够在保持计算效率的同时,达到高精度决策和高质量结果。引入GAN元素使得生成的帧看起来更逼真,从而使框架能够创建类似于现实世界的输出序列。新的设计系统在时间连续性和空间准确性之间保持平衡,以提供可靠的视频预测。通过结合感知LPIPS测量与经典测试(MSE、MAE、SSIM和PSNR)的全面评估方法表明,在Moving MNIST、KTH Action 和 CASIA-B(预处理)数据集上进行基准测试时,我们的方法比现有方法表现更优。研究表明,MAUCell在运行时间要求上有潜力实现改进。研究结果展示了GANs与注意力机制结合使用以创建更好的视频序列预测应用的优越性。
https://arxiv.org/abs/2501.16997
Spatiotemporal modeling of real-world data poses a challenging problem due to inherent high dimensionality, measurement noise, and expensive data collection procedures. In this paper, we present Sparse Identification of Nonlinear Dynamics with SHallow REcurrent Decoder networks (SINDy-SHRED), a method to jointly solve the sensing and model identification problems with simple implementation, efficient computation, and robust performance. SINDy-SHRED uses Gated Recurrent Units (GRUs) to model the temporal sequence of sensor measurements along with a shallow decoder network to reconstruct the full spatiotemporal field from the latent state space using only a few available sensors. Our proposed algorithm introduces a SINDy-based regularization; beginning with an arbitrary latent state space, the dynamics of the latent space progressively converges to a SINDy-class functional, provided the projection remains within the set. In restricting SINDy to a linear model, the architecture produces a Koopman-SHRED model which enforces a linear latent space dynamics. We conduct a systematic experimental study including synthetic PDE data, real-world sensor measurements for sea surface temperature, and direct video data. With no explicit encoder, SINDy-SHRED and Koopman-SHRED enable efficient training with minimal hyperparameter tuning and laptop-level computing; further, it demonstrates robust generalization in a variety of applications with minimal to no hyperparameter adjustments. Finally, the interpretable SINDy and Koopman models of latent state dynamics enables accurate long-term video predictions, achieving state-of-the-art performance and outperforming all baseline methods considered, including Convolutional LSTM, PredRNN, ResNet, and SimVP.
现实世界数据的时空建模面临挑战,原因在于其固有的高维度、测量噪声以及昂贵的数据采集过程。本文介绍了一种名为“基于稀疏非线性动力学识别与浅层递归解码器网络的方法”(Sparse Identification of Nonlinear Dynamics with SHallow REcurrent Decoder networks, SINDy-SHRED)的新方法,该方法能够同时解决传感和模型识别问题,并且具有简单实现、高效计算以及鲁棒性能的特点。SINDy-SHRED使用门控循环单元(Gated Recurrent Units, GRUs)来建模传感器测量的时间序列,并结合浅层解码器网络从仅有少量可用传感器的状态空间中重建完整的时空场。 该方法引入了基于SINDy的正则化技术:从任意初始状态空间开始,通过逐步调整动态过程,最终使得隐含空间的动力学逐渐收敛于一类SINDy功能。在将SINDy限制为线性模型的情况下,架构生成一个Koopman-SHRED模型,该模型强制执行线性的隐藏空间动力学。 我们进行了一系列系统实验研究,包括合成PDE数据、海面温度的真实传感器测量以及直接视频数据。没有显式编码器的情况下,SINDy-SHRED和Koopman-SHRED方法允许使用极小的超参数调整和笔记本级别计算实现高效的训练;此外,在各种应用中展示了优秀的泛化能力,并且在这些应用中几乎没有进行任何或很少的超参数调整。 最后,通过可解释的SINDy及Koopman模型对隐含状态动态的建模,实现了准确的长时间视频预测,在性能上超越了所有考虑中的基线方法(包括卷积LSTM、PredRNN、ResNet和SimVP),达到了最先进的水平。
https://arxiv.org/abs/2501.13329
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
我们介绍了MAGI,这是一个结合了掩码建模(用于帧内生成)和因果建模(用于下一帧生成)的混合视频生成框架。我们的关键创新是“完整教师强制”(Complete Teacher Forcing,CTF),它使掩码帧依赖于完整的观察帧而非仅依赖于其他掩码帧(即Masked Teacher Forcing,MTF)。这种方法使得从标记级(补丁级)到帧级的自回归生成能够平滑过渡。实验结果显示,CTF显著优于MTF,在第一帧条件下的视频预测中,FVD评分提高了23%。 为了应对诸如曝光偏差等问题,我们采用了有针对性的训练策略,并在自回归视频生成领域设立了新的基准标准。实验表明,MAGI能够在仅使用16帧进行训练的情况下,生成超过100帧长且连贯的视频序列,这突显了其在大规模、高质量视频生成中的巨大潜力。
https://arxiv.org/abs/2501.12389
Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
视频预测对于智能代理(如机器人和自动驾驶车辆)来说是一项关键任务,因为它使它们能够预见并提前应对时间敏感的事件。目前最先进的视频预测方法通常会同时隐式地建模场景的动力学特性,并不明确地将其分解为单独的对象。这种做法具有挑战性且可能次优,因为动态场景中的每个对象都有自己独特的运动模式,这些模式通常是相互独立的。在本文中,我们研究了在潜变量变压器视频预测模型的背景下,显式地分别建模动态场景中的各个对象所带来的益处。我们在合成和真实世界的多个数据集上进行了详细且严格控制的实验;我们的结果显示,在具有类似容量但没有这种分解的模型相比,将动态场景进行分解可以提高预测质量。
https://arxiv.org/abs/2501.10562
Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.
最近的机器人技术进步集中在开发能够执行多种任务的通用策略上。通常,这些策略使用预训练的视觉编码器来捕捉当前观察中的关键信息。然而,先前的视觉编码器是基于两图像对比学习或单图像重建进行训练的,它们无法完美地捕获对具身任务至关重要的顺序信息。最近,视频扩散模型(VDMs)展示了准确预测未来图像序列的能力,表明了其对物理动力学的良好理解。受到 VDM 强大的视觉预测能力的启发,我们假设它们内在拥有反映物理世界演化的视觉表示,我们将这种表示称为预测性视觉表示。基于这一假设,我们提出了视频预测策略(VPP),这是一种根据来自 VDM 的预测性视觉表示条件的通用机器人策略。为了进一步增强这些表示,我们结合了多样的人类或机器人操作数据集,并采用了统一的视频生成训练目标。在两个模拟和两个现实世界的基准测试中,VPP 一直优于现有的方法。值得注意的是,在 Calvin ABC-D 基准上,与之前的最先进技术相比,它实现了相对提高28.1%,并且对于复杂的实际灵巧操控任务的成功率提高了28.8%。
https://arxiv.org/abs/2412.14803
The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.
视频生成领域取得了显著的进步,但仍迫切需要一个清晰、系统的方案来指导构建强大且可扩展的模型。在这项工作中,我们进行了一项全面的研究,系统地探讨了模型架构、训练方法和数据整理策略之间的相互作用,最终形成了一种简单而可扩展的文本-图像条件视频生成方法,名为STIV。我们的框架通过帧替换将图像条件整合到扩散转换器(DiT)中,并通过联合的图像-文本条件无引导分类器引入文本条件。这种设计使STIV能够同时执行从文本到视频(T2V)和从文本-图像到视频(TI2V)的任务。此外,STIV可以轻松扩展到各种应用,例如视频预测、帧插值、多视图生成和长视频生成等。通过对T2I、T2V和TI2V进行详尽的消融研究,尽管设计简单,STIV仍表现出色。一个87亿参数模型在512分辨率下,在VBench T2V任务上达到了83.1分,超过了领先的开源和闭源模型如CogVideoX-5B、Pika、Kling和Gen-3等。相同规模的模型在512分辨率下的VBench I2V任务中也取得了90.1的最佳结果。通过提供一个透明且可扩展的方案来构建前沿的视频生成模型,我们的目标是推动未来的研究,并加速迈向更灵活可靠的视频生成解决方案的步伐。
https://arxiv.org/abs/2412.07730
Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR action robot, Human3.6M and UCF101, demonstrating its efficacy in achieving state-of-the-art performance on these benchmarks.
多步预测模型,如扩散模型和修正流模型,已作为生成任务的最新解决方案出现。然而,与单步方法相比,这些模型在采样新帧时表现出更高的延迟。考虑到一个典型的60秒视频包含大约1500帧,在将此类方法适应于视频预测任务时,这种延迟问题成为一个显著的瓶颈。本文提出了一种新的多步过程建模方法,旨在缓解这一延迟限制并促进其向视频预测任务的适应。我们的方法不仅减少了预测下一帧所需的采样步骤数量,还通过将模型大小减少到原来的三分之一来降低计算需求。我们在标准的视频预测数据集上评估了该方法,包括KTH、BAIR动作机器人、Human3.6M和UCF101,结果表明它在这些基准测试中达到了最先进的性能水平。
https://arxiv.org/abs/2412.05633
Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video prediction, mainly because they treat videos as a collection of independent images, relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper, we introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames. We also report a reduction of 75\% sampling steps required to sample a new frame thus making our framework more efficient during the inference time. Through extensive experimentation, we establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project page this https URL for video results.}
扩散模型在图像生成方面取得了显著进展,掌握了诸如无条件图像合成、文本-图像翻译和图像到图像转换等任务。然而,在视频预测领域,它们的能力有所不足,主要是因为这些模型将视频视为一系列独立的图像,并依赖外部约束(如时间注意力机制)来维持时间连贯性。在我们的论文中,我们引入了一种新型模型类别,该类模型将视频视为一个连续多维过程,而不是一系列离散帧。我们还报告称,在采样新帧所需的步骤上减少了75%,从而提高了推理阶段的效率。通过广泛的实验,我们在视频预测方面达到了业界领先水平,并在KTH、BAIR、Human3.6M和UCF101等基准数据集上进行了验证。访问此项目页面([此处为链接]),可查看视频结果。
https://arxiv.org/abs/2412.04929
Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. Due to inherent uncertainties, existing prediction models often struggle with the complexity of motion dynamics and occlusions. In this paper, we propose a novel stochastic long-term video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. By integrating frames generated through forward and backward warpings, our approach effectively compensates for the weaknesses of each technique, improving the prediction accuracy and realism of moving regions in videos while also addressing uncertainty by making stochastic predictions that account for various motions. Furthermore, considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model. Our model, called SVPHW, achieves state-of-the-art performance on two benchmark datasets.
https://arxiv.org/abs/2412.03061
Accurate forecasts of distributed solar generation are necessary to reduce negative impacts resulting from the increased uptake of distributed solar photovoltaic (PV) systems. However, the high variability of solar generation over short time intervals (seconds to minutes) caused by cloud movement makes this forecasting task difficult. To address this, using cloud images, which capture the second-to-second changes in cloud cover affecting solar generation, has shown promise. Recently, deep neural networks with "attention" that focus on important regions of an image have been applied with success in many computer vision applications. However, their use for forecasting cloud movement has not yet been extensively explored. In this work, we propose an attention-based convolutional long short-term memory network to forecast cloud movement and apply an existing self-attention-based method previously proposed for video prediction to forecast cloud movement. We investigate and discuss the impact of cloud forecasts from attention-based methods towards forecasting distributed solar generation, compared to cloud forecasts from non-attention-based methods. We further provide insights into the different solar forecast performances that can be achieved for high and low altitude clouds. We find that for clouds at high altitudes, the cloud predictions obtained using attention-based methods result in solar forecast skill score improvements of 5.86% or more compared to non-attention-based methods.
准确预测分布式太阳能发电对于减少由于分布式光伏系统增加使用所带来的负面影响是必要的。然而,由云层移动引起的短时间间隔(秒到分钟)内太阳能发电的高变异性使得这种预测任务变得困难。为了解决这一问题,使用捕捉到影响太阳能发电的云覆盖情况每秒钟变化的云图显示出前景。最近,在许多计算机视觉应用中成功地应用了具有“注意力”机制的深度神经网络,这些网络专注于图像的重要区域。然而,它们在预测云层移动方面的应用尚未得到广泛探索。在这项工作中,我们提出了一种基于注意力的卷积长短时记忆网络来预测云层移动,并将一种现有的基于自注意力的方法应用于视频预测以预测云层移动。我们研究并讨论了来自基于注意力方法的云预测对分布式太阳能发电预测的影响,与非基于注意力方法相比如何。进一步地,我们提供了对于高海拔和低海拔云的不同太阳能预报性能的见解。我们发现,对于高海拔的云层,使用基于注意力的方法获得的云预测可以使太阳能预报技能评分比非基于注意力的方法提高5.86%或更多。
https://arxiv.org/abs/2411.10921
Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
利用纯粹的视频数据进行强化学习(RL)预训练是一个有价值但具有挑战性的问题。尽管野外视频容易获取并且蕴含大量先验世界知识,但是缺乏动作注释和与下游任务常见的领域差距阻碍了将视频用于RL预训练。为了解决使用视频进行预训练的挑战,我们提出了预训练视觉动态表示(PVDR),以弥合视频与下游任务之间的领域差距,并实现高效的策略学习。通过采用视频预测作为预训练任务,我们使用基于Transformer的条件变分自编码器(CVAE)来学习视觉动态表示。预先训练好的视觉动态表示捕捉了视频中的视觉动态先验知识。这种抽象的先验知识可以轻松适应下游任务并通过在线调整与可执行动作对齐。我们在一系列机器人视觉控制任务上进行了实验,并验证了PVDR是一种有效的利用视频进行预训练的形式,能够促进策略学习。
https://arxiv.org/abs/2411.03169
Temporal prediction is inherently uncertain, but representing the ambiguity in natural image sequences is a challenging high-dimensional probabilistic inference problem. For natural scenes, the curse of dimensionality renders explicit density estimation statistically and computationally intractable. Here, we describe an implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames. We show that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction. Synthetic experiments demonstrate that this score-based framework can handle occlusion boundaries: unlike classical methods that average over bifurcating temporal trajectories, it chooses among likely trajectories, selecting more probable options with higher frequency. Furthermore, analysis of networks trained on natural image sequences reveals that the representation automatically weights predictive evidence by its reliability, which is a hallmark of statistical inference
时间预测本质上是不确定的,但在自然图像序列中表示这种模糊性是一个具有挑战性的高维概率推断问题。对于自然场景而言,维度灾难使得显式密度估计在统计学和计算上都不可行。在这里,我们描述了一种基于隐式回归的框架,用于学习和采样给定先前观察到的画面时视频下一帧的条件密度。我们的研究表明,在简单抗噪目标函数上训练的序列到图像深度网络能够提取适应性的表示来进行时间预测。合成实验表明,这种基于评分的框架可以处理遮挡边界:与经典的通过平均分岔时间轨迹的方法不同,它选择最可能的轨迹,并以更高的频率选取更有可能的选择。此外,对在自然图像序列上训练的网络进行分析发现,该表示会自动根据其可靠性加权预测证据,这是统计推理的一个显著特征。 希望这段翻译符合您的需求!如果有任何进一步的问题或需要调整的地方,请随时告诉我。
https://arxiv.org/abs/2411.00842
We introduce motion graph, a novel approach to the video prediction problem, which predicts future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.
我们介绍了运动图(motion graph),一种新颖的视频预测方法,该方法可以从有限的历史数据中预测未来的视频帧。运动图将视频帧中的补丁转换为相互连接的图节点,以全面描述它们之间的时空关系。这种表示方式克服了现有运动表示方法(如图像差异、光流和运动矩阵)要么无法捕捉复杂的运动模式,要么内存消耗过大的局限性。我们进一步提出了一种由运动图支持的视频预测流水线,显示出了显著的性能提升和成本降低。在包括UCF Sports、KITTI 和 Cityscapes在内的各种数据集上的实验表明了运动图强大的表示能力。特别是在UCF Sports数据集上,我们的方法不仅与最先进的(SOTA)方法相匹配甚至超越它们,而且模型大小减少了78%,GPU内存使用量大幅降低了47%。
https://arxiv.org/abs/2410.22288
Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.
预先在大规模互联网数据上进行训练的图像和视频生成模型可以显著提升机器人学习系统的泛化能力。这些模型可以用作高级规划器,为低级目标条件策略生成中间子目标。然而,生成模型与低级控制器之间的接口可能会极大地限制系统性能。例如,生成模型可能预测出逼真但物理上不可行的帧,从而混淆低级策略;低级策略也可能对生成的目标图像中的细微视觉瑕疵敏感。本文解决了这两个泛化方面的挑战,并提供了一个有效的接口,将语言条件下的图像或视频预测模型与低级目标导向政策“粘合”在一起。我们的方法——生成层次模仿学习-胶水(GHIL-Glue),过滤掉不会导致任务进展的子目标,并提高针对带有有害视觉瑕疵的生成子目标的目标条件策略的鲁棒性。在模拟和真实环境中的广泛实验中,我们发现GHIL-Glue在利用生成子目标的几个层次模型上实现了25%的整体性能提升,在使用单个RGB摄像头观察数据的CALVIN仿真基准测试中达到了新的技术水平。GHIL-Glue还在物理实验中对零样本泛化进行测试的语言条件操作任务中的3/4中超越了其他通用机器人策略。
https://arxiv.org/abs/2410.20018
Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at this https URL.
机器人与物体互动的视频编码包含了丰富的关于物体动态的信息。然而,现有的视频预测方法通常不会明确地考虑来自视频中的3D信息,如机器人的动作和物体的三维状态,这限制了它们在实际机器人应用中的使用。在这项工作中,我们提出了一种框架,通过明确考虑机器人的行动轨迹及其对场景动力学的影响,直接从多视角RGB视频中学习物体的动力学特性。我们利用3D高斯散射(3DGS)的三维高斯表示,用图神经网络训练了一个基于粒子的动力模型。该模型在稀疏控制颗粒上运行,这些颗粒是从密集追踪的三维高斯重建中降采样而来。通过在线下机器人互动数据上学习神经动力学模型,我们的方法可以预测不同初始配置和未见机器人动作下的物体运动。通过对控制颗粒运动的插值,可以获得高斯的3D变换,从而渲染出预测的未来物体状态,并实现基于行动条件的视频预测。这个动力学模型还可以应用于基于模型的规划框架中,以处理物体操作任务。我们在各种可变形材料上进行了实验,包括绳子、衣物和填充动物玩具,展示了我们的框架在建模复杂形状和动态方面的能力。我们的项目页面可以在以下链接找到:[此https URL]。
https://arxiv.org/abs/2410.18912
Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.
https://arxiv.org/abs/2410.15728
World models integrate raw data from various modalities, such as images and language to simulate comprehensive interactions in the world, thereby displaying crucial roles in fields like mixed reality and robotics. Yet, applying the world model for accurate video prediction is quite challenging due to the complex and dynamic intentions of the various scenes in practice. In this paper, inspired by the human rethinking process, we decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner. Alongside these tasks, we introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation. EVA-Bench focused on evaluating the video prediction ability of human and robot actions, presenting significant challenges for both the language model and the generation model. Targeting embodied video prediction, we propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation. EVA integrates a video generation model with a visual language model, effectively combining reasoning capabilities with high-quality generation. Moreover, to enhance the generalization of our framework, we tailor-designed a multi-stage pretraining paradigm that adaptatively ensembles LoRA to produce high-fidelity results. Extensive experiments on EVA-Bench highlight the potential of EVA to significantly improve performance in embodied scenes, paving the way for large-scale pre-trained models in real-world prediction tasks.
https://arxiv.org/abs/2410.15461
The increase in Arctic marine activity due to rapid warming and significant sea ice loss necessitates highly reliable, short-term sea ice forecasts to ensure maritime safety and operational efficiency. In this work, we present a novel data-driven approach for sea ice condition forecasting in the Gulf of Ob, leveraging sequences of radar images from Sentinel-1, weather observations, and GLORYS forecasts. Our approach integrates advanced video prediction models, originally developed for vision tasks, with domain-specific data preprocessing and augmentation techniques tailored to the unique challenges of Arctic sea ice dynamics. Central to our methodology is the use of uncertainty quantification to assess the reliability of predictions, ensuring robust decision-making in safety-critical applications. Furthermore, we propose a confidence-based model mixture mechanism that enhances forecast accuracy and model robustness, crucial for reliable operations in volatile Arctic environments. Our results demonstrate substantial improvements over baseline approaches, underscoring the importance of uncertainty quantification and specialized data handling for effective and safe operations and reliable forecasting.
由于北极快速变暖和海冰显著减少导致的北极海洋活动增加,需要高度可靠的短期海冰预报以确保海上安全和运营效率。在这项工作中,我们提出了一种新颖的数据驱动方法,用于鄂毕湾海域的海冰状况预测,该方法利用了Sentinel-1雷达图像序列、气象观测数据及GLORYS预报数据。我们的方法将最初为视觉任务开发的先进视频预测模型与特定领域的数据预处理和增强技术相结合,以应对北极海冰动态的独特挑战。我们方法的核心在于使用不确定性量化来评估预测的可靠性,确保在关键安全应用中的稳健决策。此外,我们提出了一种基于置信度的模型混合机制,该机制增强了预报准确性并提高了模型鲁棒性,在波动不定的北极环境中进行可靠操作至关重要。我们的结果显示了相对于基线方法的重大改进,突显了不确定性量化和专业化数据处理对于实现有效且安全的操作以及可靠的预测的重要性。
https://arxiv.org/abs/2410.19782