Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
https://arxiv.org/abs/2412.02617
Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
https://arxiv.org/abs/2412.01090
Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.
https://arxiv.org/abs/2412.00857
This study explores the application of deep learning for rainfall prediction, leveraging the Spinning Enhanced Visible and Infrared Imager (SEVIRI) High rate information transmission (HRIT) data as input and the Operational Program on the Exchange of weather RAdar information (OPERA) ground-radar reflectivity data as ground truth. We use the mean of 4 InfraRed frequency channels as the input. The radiance images are forecasted up to 4 hours into the future using a dense optical flow algorithm. A conditional generative adversarial network (GAN) model is employed to transform the predicted radiance images into rainfall images which are aggregated over the 4 hour forecast period to generate cumulative rainfall values. This model scored a value of approximately 7.5 as the Continuous Ranked Probability Score (CRPS) in the Weather4Cast 2024 competition and placed 1st on the core challenge leaderboard.
https://arxiv.org/abs/2412.00451
In neural video codecs, current state-of-the-art methods typically adopt multi-scale motion compensation to handle diverse motions. These methods estimate and compress either optical flow or deformable offsets to reduce inter-frame redundancy. However, flow-based methods often suffer from inaccurate motion estimation in complicated scenes. Deformable convolution-based methods are more robust but have a higher bit cost for motion coding. In this paper, we propose a hybrid context generation module, which combines the advantages of the above methods in an optimal way and achieves accurate compensation at a low bit cost. Specifically, considering the characteristics of features at different scales, we adopt flow-guided deformable compensation at largest-scale to produce accurate alignment in detailed regions. For smaller-scale features, we perform flow-based warping to save the bit cost for motion coding. Furthermore, we design a local-global context enhancement module to fully explore the local-global information of previous reconstructed signals. Experimental results demonstrate that our proposed Hybrid Local-Global Context learning (HLGC) method can significantly enhance the state-of-the-art methods on standard test datasets.
https://arxiv.org/abs/2412.00446
This paper proposes an enhancement to the ORB-SLAM3 algorithm, tailored for applications on rugged road surfaces. Our improved algorithm adeptly combines feature point matching with optical flow methods, capitalizing on the high robustness of optical flow in complex terrains and the high precision of feature points on smooth surfaces. By refining the inter-frame matching logic of ORB-SLAM3, we have addressed the issue of frame matching loss on uneven roads. To prevent a decrease in accuracy, an adaptive matching mechanism has been incorporated, which increases the reliance on optical flow points during periods of high vibration, thereby effectively maintaining SLAM precision. Furthermore, due to the scarcity of multi-sensor datasets suitable for environments with bumpy roads or speed bumps, we have collected LiDAR and camera data from such settings. Our enhanced algorithm, ORB-SLAM3AB, was then benchmarked against several advanced open-source SLAM algorithms that rely solely on laser or visual data. Through the analysis of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics, our results demonstrate that ORB-SLAM3AB achieves superior robustness and accuracy on rugged road surfaces.
本文提出了一种针对复杂路面环境的ORB-SLAM3算法改进方案。我们的改进算法巧妙地结合了特征点匹配与光流方法,利用光流在复杂地形中的高鲁棒性和特征点在平滑表面上的高精度。通过优化ORB-SLAM3的帧间匹配逻辑,我们解决了不平坦路面上帧匹配丢失的问题。为了防止准确率下降,我们引入了一种自适应匹配机制,在振动较高的情况下增加对光流点的依赖,从而有效地保持SLAM的精度。此外,由于缺乏适用于崎岖路面或减速带环境下的多传感器数据集,我们收集了此类环境下的激光雷达和相机数据。随后,我们的改进算法ORB-SLAM3AB与几种仅基于激光或视觉数据的先进开源SLAM算法进行了比较测试。通过对绝对轨迹误差(ATE)和相对姿态误差(RPE)指标进行分析,结果显示,ORB-SLAM3AB在复杂路面环境中表现出了优越的鲁棒性和准确性。
https://arxiv.org/abs/2411.18174
With the rapid advancements in deep learning, computer vision tasks have seen significant improvements, making two-stream neural networks a popular focus for video based action recognition. Traditional models using RGB and optical flow streams achieve strong performance but at a high computational cost. To address this, we introduce a representation flow algorithm to replace the optical flow branch in the egocentric action recognition model, enabling end-to-end training while reducing computational cost and prediction time. Our model, designed for egocentric action recognition, uses class activation maps (CAMs) to improve accuracy and ConvLSTM for spatio temporal encoding with spatial attention. When evaluated on the GTEA61, EGTEA GAZE+, and HMDB datasets, our model matches the accuracy of the original model on GTEA61 and exceeds it by 0.65% and 0.84% on EGTEA GAZE+ and HMDB, respectively. Prediction runtimes are significantly reduced to 0.1881s, 0.1503s, and 0.1459s, compared to the original model's 101.6795s, 25.3799s, and 203.9958s. Ablation studies were also conducted to study the impact of different parameters on model performance. Keywords: two-stream, egocentric, action recognition, CAM, representation flow, CAM, ConvLSTM
随着深度学习的迅速发展,计算机视觉任务取得了显著改进,使得双流神经网络成为基于视频的动作识别的热门焦点。传统的使用RGB和光流分支的模型虽然性能强大,但计算成本高昂。为了解决这一问题,我们引入了一种表示流算法来替代自视角动作识别模型中的光流分支,实现了端到端训练的同时降低了计算成本和预测时间。我们的模型专门针对自视角动作识别设计,使用类激活图(CAMs)提高精度,并采用带有空间注意力机制的ConvLSTM进行时空编码。在GTEA61、EGTEA GAZE+ 和 HMDB数据集上的评估表明,我们的模型在GTEA61上达到了与原模型相同的准确性,在EGTEA GAZE+和HMDB上的准确率分别提高了0.65%和0.84%。预测运行时间显著减少至0.1881秒、0.1503秒和0.1459秒,而原始模型的预测时间为101.6795秒、25.3799秒和203.9958秒。我们还进行了消融研究以探讨不同参数对模型性能的影响。关键词:双流网络,自视角,动作识别,CAM,表示流,ConvLSTM
https://arxiv.org/abs/2411.18002
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
https://arxiv.org/abs/2411.18650
We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
我们提出了Buffer Anytime框架,该框架可以从视频中估计深度图和法线图(统称为几何缓冲),从而消除了对配对的视频-深度数据或视频-法线训练数据的需求。我们的方法不依赖大规模标注视频数据集,而是通过利用单图像先验和时间一致性约束来实现高质量的视频缓冲估计。我们采用零样本训练策略,结合基于光流平滑性的最先进的图像估算模型,并通过轻量级的时间注意力架构实现这一目标。将该方法应用于领先的图像模型如Depth Anything V2 和Marigold-E2E-FT时,我们的方法在保持准确性的同时显著提高了时间一致性。实验表明,即使没有使用配对视频数据,我们的方法不仅超越了基于图像的方法,而且在结果上也达到了与大规模配对视频数据集训练的最先进的视频模型相当的水平。
https://arxiv.org/abs/2411.17249
Traditional neural network-driven inpainting methods struggle to deliver high-quality results within the constraints of mobile device processing power and memory. Our research introduces an innovative approach to optimize memory usage by altering the composition of input data. Typically, video inpainting relies on a predetermined set of input frames, such as neighboring and reference frames, often limited to five-frame sets. Our focus is to examine how varying the proportion of these input frames impacts the quality of the inpainted video. By dynamically adjusting the input frame composition based on optical flow and changes of the mask, we have observed an improvement in various contents including rapid visual context changes.
传统的基于神经网络的修复方法在移动设备的处理能力和内存限制下难以提供高质量的结果。我们的研究提出了一种创新的方法,通过改变输入数据的构成来优化内存使用。通常,视频修复依赖于一组预定义的输入帧,比如相邻帧和参考帧,这些帧集往往限于五帧。我们关注的是,调整这些输入帧的比例如何影响修复后的视频质量。通过根据光流和掩码变化动态调整输入帧组成,我们在包括快速视觉上下文变化的各种内容中观察到了改进。
https://arxiv.org/abs/2411.16926
Simultaneous localization and mapping (SLAM) has achieved impressive performance in static environments. However, SLAM in dynamic environments remains an open question. Many methods directly filter out dynamic objects, resulting in incomplete scene reconstruction and limited accuracy of camera localization. The other works express dynamic objects by point clouds, sparse joints, or coarse meshes, which fails to provide a photo-realistic representation. To overcome the above limitations, we propose a photo-realistic and geometry-aware RGB-D SLAM method by extending Gaussian splatting. Our method is composed of three main modules to 1) map the dynamic foreground including non-rigid humans and rigid items, 2) reconstruct the static background, and 3) localize the camera. To map the foreground, we focus on modeling the deformations and/or motions. We consider the shape priors of humans and exploit geometric and appearance constraints of humans and items. For background mapping, we design an optimization strategy between neighboring local maps by integrating appearance constraint into geometric alignment. As to camera localization, we leverage both static background and dynamic foreground to increase the observations for noise compensation. We explore the geometric and appearance constraints by associating 3D Gaussians with 2D optical flows and pixel patches. Experiments on various real-world datasets demonstrate that our method outperforms state-of-the-art approaches in terms of camera localization and scene representation. Source codes will be publicly available upon paper acceptance.
同时定位与地图构建(SLAM)在静态环境中已取得了令人印象深刻的表现。然而,动态环境中的SLAM仍然是一个开放性问题。许多方法直接过滤掉动态物体,导致场景重建不完整以及相机定位精度受限。其他的方法则用点云、稀疏关节或粗糙网格来表示动态物体,这无法提供照片级真实的表示。为了克服上述限制,我们提出了一种基于扩展高斯喷绘技术的照片真实感和几何感知的RGB-D SLAM方法。我们的方法主要由三个模块组成:1) 映射包含非刚性人类和刚性物品的动态前景;2) 重建静态背景;3) 定位相机。为了映射前景,我们专注于建模变形和/或运动,考虑了人体形状先验,并利用了人体和物品的几何与外观约束。对于背景映射,我们设计了一种优化策略,在相邻局部地图之间通过整合外观约束到几何对齐中进行操作。至于相机定位,我们结合静态背景和动态前景来增加用于噪声补偿的观测值。我们通过关联3D高斯分布与2D光流及像素补丁来探索几何和外观约束。在各种现实世界数据集上的实验表明,我们的方法在相机定位和场景表示方面优于现有最先进的方法。源代码将在论文被接受后公开提供。
https://arxiv.org/abs/2411.15800
While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
尽管文本到视频的扩散模型已经取得了显著的进步,许多此类模型在生成具有时间一致性的视频时仍然面临挑战。在扩散框架内,引导技术已被证明可以在推理过程中提高输出质量;然而,将这些方法应用于视频扩散模型则带来了处理整个序列计算的额外复杂性。为了解决这个问题,我们提出了一种名为MotionPrompt的新框架,该框架通过光流来指导视频生成过程。具体来说,我们训练了一个判别器,用于区分真实视频与生成视频中随机帧对之间的光流差异。鉴于提示可以影响整个视频,我们在反向采样步骤中使用从已训练的判别器应用于随机帧对得到的梯度优化可学习的令牌嵌入。这种方法使我们的方法能够生成视觉连贯且紧密反映自然运动动态的视频序列,而不牺牲生成内容的真实感。我们证明了该方法在各种模型中的有效性。
https://arxiv.org/abs/2411.15540
Optical flow estimation is extensively used in autonomous driving and video editing. While existing models demonstrate state-of-the-art performance across various benchmarks, the robustness of these methods has been infrequently investigated. Despite some research focusing on the robustness of optical flow models against adversarial attacks, there has been a lack of studies investigating their robustness to common corruptions. Taking into account the unique temporal characteristics of optical flow, we introduce 7 temporal corruptions specifically designed for benchmarking the robustness of optical flow models, in addition to 17 classical single-image corruptions, in which advanced PSF Blur simulation method is performed. Two robustness benchmarks, KITTI-FC and GoPro-FC, are subsequently established as the first corruption robustness benchmark for optical flow estimation, with Out-Of-Domain (OOD) and In-Domain (ID) settings to facilitate comprehensive studies. Robustness metrics, Corruption Robustness Error (CRE), Corruption Robustness Error ratio (CREr), and Relative Corruption Robustness Error (RCRE) are further introduced to quantify the optical flow estimation robustness. 29 model variants from 15 optical flow methods are evaluated, yielding 10 intriguing observations, such as 1) the absolute robustness of the model is heavily dependent on the estimation performance; 2) the corruptions that diminish local information are more serious than that reduce visual effects. We also give suggestions for the design and application of optical flow models. We anticipate that our benchmark will serve as a foundational resource for advancing research in robust optical flow estimation. The benchmarks and source code will be released at this https URL.
光流估计在自动驾驶和视频编辑中被广泛使用。虽然现有的模型展示了跨各种基准测试的先进性能,但这些方法的鲁棒性却很少被探究。尽管一些研究关注了光流模型对抗攻击的鲁棒性,但仍缺乏关于它们对常见干扰的鲁棒性的研究。考虑到光流的独特时间特征,我们引入了7种专门设计用于评估光流模型鲁棒性的时序干扰,并结合17种经典的单图像干扰,在这些干扰中进行了高级PSF模糊模拟方法。随后建立了两个鲁棒性基准测试 KITTI-FC 和 GoPro-FC,作为首个针对光流估计的干扰鲁棒性基准测试,包括域外(OOD)和域内(ID)设置,以促进全面研究。还引入了鲁棒性度量指标——干扰鲁棒性误差(CRE)、干扰鲁棒性误差比率(CREr)以及相对干扰鲁棒性误差(RCRE),来量化光流估计的鲁棒性。评估了来自15种光流方法的29个模型变体,得到了10个有趣的观察结果,例如:1) 模型的绝对鲁棒性很大程度上依赖于其估计性能;2) 减少局部信息的干扰比减少视觉效果的干扰更加严重。我们也对光流模型的设计和应用提出了建议。我们期望我们的基准测试将成为推动鲁棒性光流估计研究的基础资源。这些基准测试和源代码将在以下链接发布:[此 https URL]。
https://arxiv.org/abs/2411.14865
Optical flow estimation is a critical task for tiny mobile robotics to enable safe and accurate navigation, obstacle avoidance, and other functionalities. However, optical flow estimation on tiny robots is challenging due to limited onboard sensing and computation capabilities. In this paper, we propose EdgeFlowNet , a high-speed, low-latency dense optical flow approach for tiny autonomous mobile robots by harnessing the power of edge computing. We demonstrate the efficacy of our approach by deploying EdgeFlowNet on a tiny quadrotor to perform static obstacle avoidance, flight through unknown gaps and dynamic obstacle dodging. EdgeFlowNet is about 20 faster than the previous state-of-the-art approaches while improving accuracy by over 20% and using only 1.08W of power enabling advanced autonomy on palm-sized tiny mobile robots.
光流估计对于小型移动机器人实现安全和精确导航、障碍物避让及其他功能而言是一项关键任务。然而,由于载板感知和计算能力有限,在小型机器人上进行光流估计算法面临挑战。本文中,我们提出了EdgeFlowNet,一种利用边缘计算能力的高速低延迟密集光流方法,适用于微型自主移动机器人。通过将EdgeFlowNet部署到一个小型四旋翼飞行器上执行静态障碍物避让、穿越未知间隙和动态障碍物躲避任务,我们展示了该方法的有效性。与之前的最先进方法相比,EdgeFlowNet的速度快约20倍,准确性提高了超过20%,仅消耗1.08瓦的电力,从而在掌上大小的微型移动机器人中实现了高级自主功能。
https://arxiv.org/abs/2411.14576
Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles. However, existing methods are constrained to basic material types with limited predictable parameters, making them insufficient to represent the complexity of real-world materials. We introduce a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation. Our method utilizes multi-modal models to identify material types and initialize material parameters through image queries, while simultaneously inferring 3D Gaussian splats for detailed scene representation. We further refine these material parameters using video diffusion with a differentiable Material Point Method (MPM) and optical flow guidance rather than render loss or Score Distillation Sampling (SDS) loss. This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios, advancing both accuracy and flexibility in physics-based simulations.
真实的动态场景模拟需要准确捕捉各种材料特性,并基于物理原理建模复杂的物体相互作用。然而,现有的方法受限于基本的材料类型和有限的可预测参数,这使得它们不足以表示现实世界材料的复杂性。我们引入了一种新方法,该方法利用多模态基础模型和视频扩散来实现增强的4D动态场景模拟。我们的方法通过图像查询使用多模态模型识别材料类型并初始化材料参数,同时推断3D高斯图斑以进行详细场景表示。此外,我们还使用可微分的材料点法(MPM)和光流引导,而不是渲染损失或评分蒸馏采样(SDS)损失来进一步优化这些材料参数。这一集成框架能够准确预测并实现现实世界情景中动态相互作用的真实模拟,在基于物理的模拟准确性与灵活性方面实现了进步。
https://arxiv.org/abs/2411.14423
In many video processing tasks, leveraging large-scale image datasets is a common strategy, as image data is more abundant and facilitates comprehensive knowledge transfer. A typical approach for simulating video from static images involves applying spatial transformations, such as affine transformations and spline warping, to create sequences that mimic temporal progression. However, in tasks like video salient object detection, where both appearance and motion cues are critical, these basic image-to-video techniques fail to produce realistic optical flows that capture the independent motion properties of each object. In this study, we show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components. This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements. By augmenting individual images in this way, we create large-scale image-flow pairs that significantly enhance model training. Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
在许多视频处理任务中,利用大规模图像数据集是一种常见策略,因为图像数据更丰富,并且有助于全面的知识迁移。从静态图像模拟视频的一种典型方法是应用空间变换(如仿射变换和样条变形)来创建模仿时间进程的序列。然而,在像视频显著对象检测这样的任务中,外观和运动线索都至关重要,这些基本的图像到视频技术无法生成能捕捉每个对象独立运动特性的逼真光流。在本研究中,我们展示了图像到视频扩散模型可以在理解图像组件之间上下文关系的同时,对静态图像进行现实转换。这种能力使模型能够生成可信的光流,保持语义完整性并反映场景元素的独立运动。通过这种方式增强单个图像,我们可以创建大规模的图像-光流配对,从而显著提升模型训练效果。我们的方法在所有公开基准数据集上均达到了最先进的性能,超越了现有的方法。
https://arxiv.org/abs/2411.13975
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. Traditional frameworks, such as ParticleSfM~\cite{zhao2022particlesfm}, address this problem by sequentially computing the optical flow between adjacent frames to obtain point trajectories. They then remove dynamic trajectories through motion segmentation and perform global bundle adjustment. However, the process of estimating optical flow between two adjacent frames and chaining the matches can introduce cumulative errors. Additionally, motion segmentation combined with single-view depth estimation often faces challenges related to scale ambiguity. To tackle these challenges, we propose a dynamic-aware tracking any point (DATAP) method that leverages consistent video depth and point tracking. Specifically, our DATAP addresses these issues by estimating dense point tracking across the video sequence and predicting the visibility and dynamics of each point. By incorporating the consistent video depth prior, the performance of motion segmentation is enhanced. With the integration of DATAP, it becomes possible to estimate and optimize all camera poses simultaneously by performing global bundle adjustments for point tracking classified as static and visible, rather than relying on incremental camera registration. Extensive experiments on dynamic sequences, e.g., Sintel and TUM RGBD dynamic sequences, and on the wild video, e.g., DAVIS, demonstrate that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
本文提出了一种简洁、优雅且鲁棒的流水线,用于估算平滑的相机轨迹并为野外的随意视频获取稠密点云。传统的框架,如ParticleSfM~\cite{zhao2022particlesfm},通过依次计算相邻帧之间的光流来获得点轨迹,然后移除动态轨迹并通过运动分割进行全局束调整。然而,在两帧之间估算光流并链接匹配的过程可能会引入累积误差。此外,结合单视图深度估计的运动分割通常会面临尺度模糊的问题。为了解决这些问题,我们提出了一种基于一致视频深度和点跟踪的感知动态追踪任意点(DATAP)方法。具体来说,我们的DATAP通过在整个视频序列中估算稠密点跟踪,并预测每个点的可见性和动力学特性来解决上述问题。通过结合一致的视频深度先验知识,运动分割的表现得到了增强。通过集成DATAP,可以通过对分类为静态且可见的点进行全局束调整来同时估计和优化所有相机姿态,而不是依赖于增量式的相机注册。在动态序列(例如Sintel和TUM RGBD动态序列)以及野外视频(例如DAVIS)上进行了广泛实验,结果表明所提出的方法即使在复杂的动态挑战场景中也能实现最先进的相机姿态估计算法性能。
https://arxiv.org/abs/2411.13291
The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of foreground objects. This often leads to severe under- and over-prediction problems and has been less studied in existing works. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To effectively capture the dynamic variations across frames, we utilize an optical flow-based temporal collaborative fusion that aligns features to derive multi-frame density residuals. The counting accuracy of the current frame is boosted by harnessing the information from adjacent frames. More importantly, to empower the representation ability of dynamic foreground objects for intra-frame, we first take the density map as an auxiliary modality to perform $\mathtt{D}$ensity-$\mathtt{E}$mbedded $\mathtt{M}$asked m$\mathtt{O}$deling ($\mathtt{DEMO}$) for multimodal self-representation learning to regress density map. However, as $\mathtt{DEMO}$ contributes effective cross-modal regression guidance, it also brings in redundant background information and hard to focus on foreground regions. To handle this dilemma, we further propose an efficient spatial adaptive masking derived from density maps to boost efficiency. In addition, considering most existing datasets are limited to human-centric scenarios, we first propose a large video bird counting dataset $\textit{DroneBird}$, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our $\textit{DroneBird}$ validate our superiority against the counterparts.
前景和背景之间的动态不平衡是视频对象计数中的一个主要挑战,这通常是由前景物体的稀疏性引起的。这种情况经常导致严重的欠预测和过预测问题,并且在现有的工作中研究较少。为了解决这个问题,在本文中我们提出了一种基于密度嵌入的有效掩码自动编码器计数(E-MAC)框架。为了有效地捕捉跨帧之间的动态变化,我们利用了基于光流的时间协作融合来对齐特征以推导多帧密度残差。通过利用相邻帧的信息,当前帧的计数准确性得到了提升。更重要的是,为了增强帧内动态前景对象的表现能力,我们首先将密度图作为一种辅助模态,执行$\mathtt{D}$ensity-$\mathtt{E}$mbedded $\mathtt{M}$asked m$\mathtt{O}$deling ($\mathtt{DEMO}$)进行多模态自表示学习以回归密度图。然而,虽然$\mathtt{DEMO}$提供了有效的跨模态回归指导,但也带来了冗余的背景信息并难以聚焦于前景区域。为了解决这一困境,我们进一步提出了一种基于密度图的有效空间自适应掩码来提升效率。此外,考虑到大多数现有的数据集仅限于以人类为中心的场景,我们首先提出了一个大规模视频鸟类计数数据集$\textit{DroneBird}$,用于自然环境下的候鸟保护。在三个人群数据集和我们的$\textit{DroneBird}$上的广泛实验验证了我们的方法相对于现有方法的优势。
https://arxiv.org/abs/2411.13056
We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: this https URL.
我们提出了一种统一的可控视频生成方法AnimateAnything,该方法有助于在各种条件下实现精确且一致的视频操控,包括相机轨迹、文本提示和用户动作标注。具体来说,我们精心设计了一个多尺度控制特征融合网络来构建适用于不同条件的共同运动表示。它将所有控制信息显式地转换为逐帧光流。然后我们将这些光流作为运动先验来指导最终的视频生成。此外,为了减少大规模运动引起的闪烁问题,我们提出了一种基于频率的稳定模块。它可以确保视频频域的一致性,从而增强时间连贯性。实验表明,我们的方法优于最先进的方法。更多详情和视频,请参见网页:此 https URL。
https://arxiv.org/abs/2411.10836
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.
我们考虑了在各种应用中具有精确控制的文本到视频生成任务,如相机运动控制和视频到视频编辑。大多数处理这个问题的方法依赖于提供用户定义的控制,比如二值掩码或相机运动嵌入。在我们的方法中,我们提出了OnlyFlow,一种通过首先从输入视频提取光流来引导生成视频运动的方法。使用文本提示和一个输入视频,OnlyFlow允许用户生成既尊重输入视频中的运动又符合文本提示的视频。这通过应用在输入视频上的光流估计模型实现,然后将结果传递给可训练的光流编码器。输出的特征图随后被注入到文本到视频的基础模型中。我们进行了定量、定性和用户偏好研究,以展示OnlyFlow在广泛的任务上与最先进的方法相比具有积极的优势,尽管OnlyFlow并未专门针对这些任务进行训练。因此,OnlyFlow是一种用于控制文本到视频生成中的运动的多功能、轻量级且高效的方法。模型和代码将在GitHub和HuggingFace上提供。
https://arxiv.org/abs/2411.10501