Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. Due to inherent uncertainties, existing prediction models often struggle with the complexity of motion dynamics and occlusions. In this paper, we propose a novel stochastic long-term video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. By integrating frames generated through forward and backward warpings, our approach effectively compensates for the weaknesses of each technique, improving the prediction accuracy and realism of moving regions in videos while also addressing uncertainty by making stochastic predictions that account for various motions. Furthermore, considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model. Our model, called SVPHW, achieves state-of-the-art performance on two benchmark datasets.
https://arxiv.org/abs/2412.03061
Accurate forecasts of distributed solar generation are necessary to reduce negative impacts resulting from the increased uptake of distributed solar photovoltaic (PV) systems. However, the high variability of solar generation over short time intervals (seconds to minutes) caused by cloud movement makes this forecasting task difficult. To address this, using cloud images, which capture the second-to-second changes in cloud cover affecting solar generation, has shown promise. Recently, deep neural networks with "attention" that focus on important regions of an image have been applied with success in many computer vision applications. However, their use for forecasting cloud movement has not yet been extensively explored. In this work, we propose an attention-based convolutional long short-term memory network to forecast cloud movement and apply an existing self-attention-based method previously proposed for video prediction to forecast cloud movement. We investigate and discuss the impact of cloud forecasts from attention-based methods towards forecasting distributed solar generation, compared to cloud forecasts from non-attention-based methods. We further provide insights into the different solar forecast performances that can be achieved for high and low altitude clouds. We find that for clouds at high altitudes, the cloud predictions obtained using attention-based methods result in solar forecast skill score improvements of 5.86% or more compared to non-attention-based methods.
准确预测分布式太阳能发电对于减少由于分布式光伏系统增加使用所带来的负面影响是必要的。然而,由云层移动引起的短时间间隔(秒到分钟)内太阳能发电的高变异性使得这种预测任务变得困难。为了解决这一问题,使用捕捉到影响太阳能发电的云覆盖情况每秒钟变化的云图显示出前景。最近,在许多计算机视觉应用中成功地应用了具有“注意力”机制的深度神经网络,这些网络专注于图像的重要区域。然而,它们在预测云层移动方面的应用尚未得到广泛探索。在这项工作中,我们提出了一种基于注意力的卷积长短时记忆网络来预测云层移动,并将一种现有的基于自注意力的方法应用于视频预测以预测云层移动。我们研究并讨论了来自基于注意力方法的云预测对分布式太阳能发电预测的影响,与非基于注意力方法相比如何。进一步地,我们提供了对于高海拔和低海拔云的不同太阳能预报性能的见解。我们发现,对于高海拔的云层,使用基于注意力的方法获得的云预测可以使太阳能预报技能评分比非基于注意力的方法提高5.86%或更多。
https://arxiv.org/abs/2411.10921
Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
利用纯粹的视频数据进行强化学习(RL)预训练是一个有价值但具有挑战性的问题。尽管野外视频容易获取并且蕴含大量先验世界知识,但是缺乏动作注释和与下游任务常见的领域差距阻碍了将视频用于RL预训练。为了解决使用视频进行预训练的挑战,我们提出了预训练视觉动态表示(PVDR),以弥合视频与下游任务之间的领域差距,并实现高效的策略学习。通过采用视频预测作为预训练任务,我们使用基于Transformer的条件变分自编码器(CVAE)来学习视觉动态表示。预先训练好的视觉动态表示捕捉了视频中的视觉动态先验知识。这种抽象的先验知识可以轻松适应下游任务并通过在线调整与可执行动作对齐。我们在一系列机器人视觉控制任务上进行了实验,并验证了PVDR是一种有效的利用视频进行预训练的形式,能够促进策略学习。
https://arxiv.org/abs/2411.03169
Temporal prediction is inherently uncertain, but representing the ambiguity in natural image sequences is a challenging high-dimensional probabilistic inference problem. For natural scenes, the curse of dimensionality renders explicit density estimation statistically and computationally intractable. Here, we describe an implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames. We show that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction. Synthetic experiments demonstrate that this score-based framework can handle occlusion boundaries: unlike classical methods that average over bifurcating temporal trajectories, it chooses among likely trajectories, selecting more probable options with higher frequency. Furthermore, analysis of networks trained on natural image sequences reveals that the representation automatically weights predictive evidence by its reliability, which is a hallmark of statistical inference
时间预测本质上是不确定的,但在自然图像序列中表示这种模糊性是一个具有挑战性的高维概率推断问题。对于自然场景而言,维度灾难使得显式密度估计在统计学和计算上都不可行。在这里,我们描述了一种基于隐式回归的框架,用于学习和采样给定先前观察到的画面时视频下一帧的条件密度。我们的研究表明,在简单抗噪目标函数上训练的序列到图像深度网络能够提取适应性的表示来进行时间预测。合成实验表明,这种基于评分的框架可以处理遮挡边界:与经典的通过平均分岔时间轨迹的方法不同,它选择最可能的轨迹,并以更高的频率选取更有可能的选择。此外,对在自然图像序列上训练的网络进行分析发现,该表示会自动根据其可靠性加权预测证据,这是统计推理的一个显著特征。 希望这段翻译符合您的需求!如果有任何进一步的问题或需要调整的地方,请随时告诉我。
https://arxiv.org/abs/2411.00842
We introduce motion graph, a novel approach to the video prediction problem, which predicts future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.
我们介绍了运动图(motion graph),一种新颖的视频预测方法,该方法可以从有限的历史数据中预测未来的视频帧。运动图将视频帧中的补丁转换为相互连接的图节点,以全面描述它们之间的时空关系。这种表示方式克服了现有运动表示方法(如图像差异、光流和运动矩阵)要么无法捕捉复杂的运动模式,要么内存消耗过大的局限性。我们进一步提出了一种由运动图支持的视频预测流水线,显示出了显著的性能提升和成本降低。在包括UCF Sports、KITTI 和 Cityscapes在内的各种数据集上的实验表明了运动图强大的表示能力。特别是在UCF Sports数据集上,我们的方法不仅与最先进的(SOTA)方法相匹配甚至超越它们,而且模型大小减少了78%,GPU内存使用量大幅降低了47%。
https://arxiv.org/abs/2410.22288
Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.
预先在大规模互联网数据上进行训练的图像和视频生成模型可以显著提升机器人学习系统的泛化能力。这些模型可以用作高级规划器,为低级目标条件策略生成中间子目标。然而,生成模型与低级控制器之间的接口可能会极大地限制系统性能。例如,生成模型可能预测出逼真但物理上不可行的帧,从而混淆低级策略;低级策略也可能对生成的目标图像中的细微视觉瑕疵敏感。本文解决了这两个泛化方面的挑战,并提供了一个有效的接口,将语言条件下的图像或视频预测模型与低级目标导向政策“粘合”在一起。我们的方法——生成层次模仿学习-胶水(GHIL-Glue),过滤掉不会导致任务进展的子目标,并提高针对带有有害视觉瑕疵的生成子目标的目标条件策略的鲁棒性。在模拟和真实环境中的广泛实验中,我们发现GHIL-Glue在利用生成子目标的几个层次模型上实现了25%的整体性能提升,在使用单个RGB摄像头观察数据的CALVIN仿真基准测试中达到了新的技术水平。GHIL-Glue还在物理实验中对零样本泛化进行测试的语言条件操作任务中的3/4中超越了其他通用机器人策略。
https://arxiv.org/abs/2410.20018
Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at this https URL.
机器人与物体互动的视频编码包含了丰富的关于物体动态的信息。然而,现有的视频预测方法通常不会明确地考虑来自视频中的3D信息,如机器人的动作和物体的三维状态,这限制了它们在实际机器人应用中的使用。在这项工作中,我们提出了一种框架,通过明确考虑机器人的行动轨迹及其对场景动力学的影响,直接从多视角RGB视频中学习物体的动力学特性。我们利用3D高斯散射(3DGS)的三维高斯表示,用图神经网络训练了一个基于粒子的动力模型。该模型在稀疏控制颗粒上运行,这些颗粒是从密集追踪的三维高斯重建中降采样而来。通过在线下机器人互动数据上学习神经动力学模型,我们的方法可以预测不同初始配置和未见机器人动作下的物体运动。通过对控制颗粒运动的插值,可以获得高斯的3D变换,从而渲染出预测的未来物体状态,并实现基于行动条件的视频预测。这个动力学模型还可以应用于基于模型的规划框架中,以处理物体操作任务。我们在各种可变形材料上进行了实验,包括绳子、衣物和填充动物玩具,展示了我们的框架在建模复杂形状和动态方面的能力。我们的项目页面可以在以下链接找到:[此https URL]。
https://arxiv.org/abs/2410.18912
Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.
https://arxiv.org/abs/2410.15728
World models integrate raw data from various modalities, such as images and language to simulate comprehensive interactions in the world, thereby displaying crucial roles in fields like mixed reality and robotics. Yet, applying the world model for accurate video prediction is quite challenging due to the complex and dynamic intentions of the various scenes in practice. In this paper, inspired by the human rethinking process, we decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner. Alongside these tasks, we introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation. EVA-Bench focused on evaluating the video prediction ability of human and robot actions, presenting significant challenges for both the language model and the generation model. Targeting embodied video prediction, we propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation. EVA integrates a video generation model with a visual language model, effectively combining reasoning capabilities with high-quality generation. Moreover, to enhance the generalization of our framework, we tailor-designed a multi-stage pretraining paradigm that adaptatively ensembles LoRA to produce high-fidelity results. Extensive experiments on EVA-Bench highlight the potential of EVA to significantly improve performance in embodied scenes, paving the way for large-scale pre-trained models in real-world prediction tasks.
https://arxiv.org/abs/2410.15461
The increase in Arctic marine activity due to rapid warming and significant sea ice loss necessitates highly reliable, short-term sea ice forecasts to ensure maritime safety and operational efficiency. In this work, we present a novel data-driven approach for sea ice condition forecasting in the Gulf of Ob, leveraging sequences of radar images from Sentinel-1, weather observations, and GLORYS forecasts. Our approach integrates advanced video prediction models, originally developed for vision tasks, with domain-specific data preprocessing and augmentation techniques tailored to the unique challenges of Arctic sea ice dynamics. Central to our methodology is the use of uncertainty quantification to assess the reliability of predictions, ensuring robust decision-making in safety-critical applications. Furthermore, we propose a confidence-based model mixture mechanism that enhances forecast accuracy and model robustness, crucial for reliable operations in volatile Arctic environments. Our results demonstrate substantial improvements over baseline approaches, underscoring the importance of uncertainty quantification and specialized data handling for effective and safe operations and reliable forecasting.
由于北极快速变暖和海冰显著减少导致的北极海洋活动增加,需要高度可靠的短期海冰预报以确保海上安全和运营效率。在这项工作中,我们提出了一种新颖的数据驱动方法,用于鄂毕湾海域的海冰状况预测,该方法利用了Sentinel-1雷达图像序列、气象观测数据及GLORYS预报数据。我们的方法将最初为视觉任务开发的先进视频预测模型与特定领域的数据预处理和增强技术相结合,以应对北极海冰动态的独特挑战。我们方法的核心在于使用不确定性量化来评估预测的可靠性,确保在关键安全应用中的稳健决策。此外,我们提出了一种基于置信度的模型混合机制,该机制增强了预报准确性并提高了模型鲁棒性,在波动不定的北极环境中进行可靠操作至关重要。我们的结果显示了相对于基线方法的重大改进,突显了不确定性量化和专业化数据处理对于实现有效且安全的操作以及可靠的预测的重要性。
https://arxiv.org/abs/2410.19782
Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.
https://arxiv.org/abs/2410.07836
Advanced facial recognition technologies and recommender systems with inadequate privacy technologies and policies for facial interactions increase concerns about bioprivacy violations. With the proliferation of video and live-streaming websites, public-face video distribution and interactions pose greater privacy risks. Existing techniques typically address the risk of sensitive biometric information leakage through various privacy enhancement methods but pose a higher security risk by corrupting the information to be conveyed by the interaction data, or by leaving certain biometric features intact that allow an attacker to infer sensitive biometric information from them. To address these shortcomings, in this paper, we propose a neural network framework, CausalVE. We obtain cover images by adopting a diffusion model to achieve face swapping with face guidance and use the speech sequence features and spatiotemporal sequence features of the secret video for dynamic video inference and prediction to obtain a cover video with the same number of frames as the secret video. In addition, we hide the secret video by using reversible neural networks for video hiding so that the video can also disseminate secret data. Numerous experiments prove that our CausalVE has good security in public video dissemination and outperforms state-of-the-art methods from a qualitative, quantitative, and visual point of view.
高级面部识别技术和推荐系统与缺乏隐私技术和政策导致的面部互动增加对生物隐私侵犯的担忧。随着视频和实时直播网站的普及,公众面视频分发和互动带来的隐私风险更大。现有的技术通常通过各种隐私增强方法来应对敏感生物特征信息泄露的风险,但通过破坏传递交互数据的信息,或者保留某些生物特征,使攻击者能够推断出敏感的生物特征信息,从而导致更高的安全风险。为了应对这些缺陷,在本文中,我们提出了一个神经网络框架CausalVE。我们通过采用扩散模型实现面部换脸,并使用秘密视频的语音序列特征和时空序列特征进行动态视频推理和预测,获得与秘密视频相同帧数的封面视频。此外,我们通过使用可逆神经网络进行视频隐藏,使视频也能传播秘密数据。大量实验证明,我们的CausalVE在公共视频传播方面具有良好的安全性,并且在质、量、视觉方面优于最先进的方法。
https://arxiv.org/abs/2409.19306
How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at this https URL
机器人操作策略如何推广到涉及未见物体类型和新动作的新任务?在本文中,我们通过人类视频生成预测运动信息,并通过调整基于生成视频的机器人策略来解决这个问题。我们没有尝试扩展昂贵的机器人数据收集,而是展示了如何利用易用的基于网络数据的视频生成模型,实现对基于生成的视频的泛化。我们的方法Gen2Act将语言条件操作表示为零样本人类视频生成,然后在生成视频中应用单个策略。为了训练策略,我们使用的机器人交互数据比视频预测模型训练时要少得多。Gen2Act不需要对视频模型进行微调,而且我们直接使用预训练模型生成人类视频。我们对各种真实世界场景的研究表明,Gen2Act能够操作未见物体类型,并为任务不在机器人数据中存在的任务执行新颖动作。视频可以在此处查看:https://www.youtube.com/watch?v=uRstRQzQgU
https://arxiv.org/abs/2409.16283
Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{this https URL}.
尽管Model Predictive Control(MPC)在机器人操作任务中能有效预测系统未来的状态,从而被广泛应用,但它没有环境感知能力,导致在一些复杂场景中出现失败。为解决这个问题,我们引入了Vision-Language Model Predictive Control(VLMPC),一个机器人操作框架,它利用了视觉语言模型(VLM)的感知能力,并将其与MPC相结合。具体来说,我们提出了一个条件动作抽样模块,它接收一个目标图像或语言指令,并利用VLM从候选动作序列中采样一组动作。然后,我们设计了一个轻量级的动作条件视频预测模型,根据候选动作序列生成一系列未来帧。VLMPC通过一个包含像素级和知识级一致性的层次成本函数,在MPC的协助下生产最优的动作序列。我们在公开基准上证明了VLMPC超越了最先进的方法。更重要的是,我们的方法在各种现实世界的机器人操作任务中表现出色。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2407.09829
We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.
我们提出了一种将领域中的程序化知识整合到深度学习模型中的通用方法。我们将它应用于视频预测场景,在基于对象中心的深度模型之上,证明了这种方法比仅使用数据驱动模型的性能更好。我们开发了一个架构,以促进潜在空间解耦,以便使用整合的程序化知识。我们还建立了一个设置,让模型在视频预测的下游任务中学习程序化接口。我们比较了这种方法的性能与最先进的基于数据驱动的方法,证明了在纯粹的数据驱动方法中遇到困难的问题可以通过了解领域知识来处理,提供了一种替代简单地收集更多数据的途径。
https://arxiv.org/abs/2406.18220
We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.
我们提出了一个新颖的视频预测架构设计,以利用数据驱动模型的计算图直接利用程序化领域知识。基于新的具有挑战性的场景,我们证明了最先进的视频预测器在复杂动态场景中表现不佳,并强调了引入先验过程知识使他们的学习问题变得可行。我们的方法使得模型数据驱动方面的学习和我们的专用程序化知识模块之间的符号地址able界面得到学习,我们在下游控制任务中将其用于实际应用。
https://arxiv.org/abs/2407.09537
We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{this https URL}{\texttt{this http URL}}.
我们提出了一种新的视频预测模型家族,称为视频占有模型(VOCs)。VOCs在紧凑的潜在空间中运行,因此不需要对单个像素进行预测。与先前的潜在空间世界模型不同,VOCs直接预测未来状态的折扣分布,因此不需要进行多步展开。我们证明了当为下游控制构建视频预测模型时,这两种特性都是有益的。代码可在此处下载:\href{this <https://this http URL>}{this this URL}。
https://arxiv.org/abs/2407.09533
Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.
深度生成模型已经证明了它们可以创建真实的人造视听内容,有时由不同领域的领域驱动。然而,视频生成中的平滑时间动态是一个具有挑战性的问题。本文专注于通用声-影合成,并提出了三个主要特征来增强生成对抗模型中图像质量和时间一致性:三倍声音路由方案、用于扩展声音分析的多尺度残差和扩散循环网络以及用于视频预测的新型循环和方向卷积层。每个所提出的特征都在质量和一致性方面比基线神经架构通常用于SoTA有所改进,视频预测层还提供了额外的时间细化。
https://arxiv.org/abs/2406.16155
Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, in many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric video understanding, 3D visual reasoning, and video prediction tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.
https://arxiv.org/abs/2406.12272
Financial assets exhibit complex dependency structures, which are crucial for investors to create diversified portfolios to mitigate risk in volatile financial markets. To explore the financial asset dependencies dynamics, we propose a novel approach that models the dependencies of assets as an Asset Dependency Matrix (ADM) and treats the ADM sequences as image sequences. This allows us to leverage deep learning-based video prediction methods to capture the spatiotemporal dependencies among assets. However, unlike images where neighboring pixels exhibit explicit spatiotemporal dependencies due to the natural continuity of object movements, assets in ADM do not have a natural order. This poses challenges to organizing the relational assets to reveal better the spatiotemporal dependencies among neighboring assets for ADM forecasting. To tackle the challenges, we propose the Asset Dependency Neural Network (ADNN), which employs the Convolutional Long Short-Term Memory (ConvLSTM) network, a highly successful method for video prediction. ADNN can employ static and dynamic transformation functions to optimize the representations of the ADM. Through extensive experiments, we demonstrate that our proposed framework consistently outperforms the baselines in the ADM prediction and downstream application tasks. This research contributes to understanding and predicting asset dependencies, offering valuable insights for financial market participants.
https://arxiv.org/abs/2406.11886