This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robots operating in highly dynamic this http URL integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less this http URL, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at this https URL
本文介绍了GeoFlow-SLAM,这是一种针对在高度动态环境中运行的四足机器人设计的强大而有效的紧密耦合RGBD-惯性SLAM方法。通过整合几何一致性、四足运动学约束以及双流光流(GeoFlow),本方法解决了三个关键挑战:快速移动时特征匹配和姿态初始化失败的问题,以及纹理不足环境中的视觉特征稀疏问题。 在快速移动场景下,利用结合先验地图点和姿态的双流光流技术显著提升了特征匹配的效果。此外,我们提出了一种针对四足机器人快速运动和惯性测量单元(IMU)误差的强大姿态初始化方法,该方法集成了IMU/四足运动学、帧间透视-n-点(PnP)算法以及广义迭代最近邻点法(GICP)。进一步地,本文首次引入了一种全新的优化框架,将深度地图与GICP几何约束紧密耦合起来,以提高长时间内在视觉纹理不足环境下的鲁棒性和精度。 所提出的算法在收集的四足机器人数据集和开源数据集中达到了最先进的性能。为了进一步促进研究和发展,开源的数据集和代码将在指定链接处公开提供(注:原文中的具体URL未给出,在实际发布时需添加相应链接)。
https://arxiv.org/abs/2503.14247
Video anomaly detection plays a significant role in intelligent surveillance systems. To enhance model's anomaly recognition ability, previous works have typically involved RGB, optical flow, and text features. Recently, dynamic vision sensors (DVS) have emerged as a promising technology, which capture visual information as discrete events with a very high dynamic range and temporal resolution. It reduces data redundancy and enhances the capture capacity of moving objects compared to conventional camera. To introduce this rich dynamic information into the surveillance field, we created the first DVS video anomaly detection benchmark, namely UCF-Crime-DVS. To fully utilize this new data modality, a multi-scale spiking fusion network (MSF) is designed based on spiking neural networks (SNNs). This work explores the potential application of dynamic information from event data in video anomaly detection. Our experiments demonstrate the effectiveness of our framework on UCF-Crime-DVS and its superior performance compared to other models, establishing a new baseline for SNN-based weakly supervised video anomaly detection.
视频异常检测在智能监控系统中扮演着重要角色。为了提升模型的异常识别能力,以往的研究通常会利用RGB图像、光流和文本特征。最近,动态视觉传感器(DVS)作为一种有前景的技术崭露头角,它能够以离散事件的形式捕捉具有极高动态范围和时间分辨率的视觉信息。这相较于传统摄像头而言,可以减少数据冗余并增强对移动物体的捕获能力。为了将这种丰富的动态信息引入监控领域,我们创建了首个DVS视频异常检测基准——UCF-Crime-DVS。 为了充分利用这一新的数据模式,基于脉冲神经网络(SNNs)设计了一种多尺度脉冲融合网络(MSF)。这项工作探索了事件数据中的动态信息在视频异常检测中的潜在应用。我们的实验展示了该框架在UCF-Crime-DVS上的有效性,并且与其它模型相比表现出更优异的性能,为基于SNN的弱监督视频异常检测建立了新的基准。
https://arxiv.org/abs/2503.12905
This paper studies optical flow estimation, a critical task in motion analysis with applications in autonomous navigation, action recognition, and film production. Traditional optical flow methods require consecutive frames, which are often unavailable due to limitations in data acquisition or real-world scene disruptions. Thus, single-frame optical flow estimation is emerging in the literature. However, existing single-frame approaches suffer from two major limitations: (1) they rely on labeled training data, making them task-specific, and (2) they produce deterministic predictions, failing to capture motion uncertainty. To overcome these challenges, we propose ProbDiffFlow, a training-free framework that estimates optical flow distributions from a single image. Instead of directly predicting motion, ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates diverse plausible future frames using a diffusion-based model, then estimates motion from these synthesized samples using a pre-trained optical flow model, and finally aggregates the results into a probabilistic flow distribution. This design eliminates the need for task-specific training while capturing multiple plausible motions. Experiments on both synthetic and real-world datasets demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and efficiency, outperforming existing single-image and two-frame baselines.
这篇论文研究了光流估计,这是一个在自主导航、动作识别和电影制作等运动分析任务中至关重要的问题。传统的光流方法需要连续的视频帧,但由于数据采集限制或现实场景中的干扰,这些连续帧往往无法获得。因此,在文献中出现了单帧光流估计的方法。然而,现有的单帧方法存在两个主要局限:(1)它们依赖于标记训练数据,使其只能用于特定任务;(2)它们产生确定性的预测结果,不能捕捉到运动的不确定性。为了解决这些问题,我们提出了ProbDiffFlow,这是一个不需要训练框架,可以从单一图像中估计光流分布的方法。不同于直接预测运动,ProbDiffFlow遵循一种通过合成进行估计的设计理念:它首先利用基于扩散模型生成多样化的未来可能帧,然后使用预训练的光流模型从这些合成样本中估计出运动,并最终将结果汇总为一个概率性的光流分布。这种设计消除了对特定任务培训的需求,同时捕捉到多种可能的运动。在合成和真实数据集上的实验表明,ProbDiffFlow实现了更高的准确性、多样性和效率,在单帧图像和双帧基准方法上表现出色。
https://arxiv.org/abs/2503.12348
Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
视觉三维运动估计的目标是基于视觉线索推断二维像素在三维空间中的运动。关键挑战在于深度变化引起的时空运动不一致性,这会破坏之前运动估算框架中关于局部空间或时间平滑性的假设。相比之下,事件相机通过连续的自适应像素级响应于场景变化,为三维运动估算提供了新的可能性。本文介绍了一种名为EMoTive的新颖事件驱动框架,该框架利用事件引导的非均匀参数曲线来建模时空轨迹,有效地表征了局部异质性时空运动。 具体来说,我们首先引入了事件Kymograph——一种基于连续时间投影核的事件投影方法,并解耦空间观察以明确编码细粒度的时间演变。对于运动表示,我们提出了一种感知密度适应机制,在事件引导下融合空间和时间特征,并结合非均匀有理曲线参数化框架来自适应地建模异质轨迹。 最终的三维运动估计是通过多时间采样参数轨迹实现的,从而生成光学流场和深度运动场。为了便于评估,我们引入了CarlaEvent3D——一个用于全面验证的多种动态合成数据集。在该数据集以及现实世界基准上的广泛实验均证明了所提出方法的有效性。
https://arxiv.org/abs/2503.11371
Video frame prediction remains a fundamental challenge in computer vision with direct implications for autonomous systems, video compression, and media synthesis. We present FG-DFPN, a novel architecture that harnesses the synergy between optical flow estimation and deformable convolutions to model complex spatio-temporal dynamics. By guiding deformable sampling with motion cues, our approach addresses the limitations of fixed-kernel networks when handling diverse motion patterns. The multi-scale design enables FG-DFPN to simultaneously capture global scene transformations and local object movements with remarkable precision. Our experiments demonstrate that FG-DFPN achieves state-of-the-art performance on eight diverse MPEG test sequences, outperforming existing methods by 1dB PSNR while maintaining competitive inference speeds. The integration of motion cues with adaptive geometric transformations makes FG-DFPN a promising solution for next-generation video processing systems that require high-fidelity temporal predictions. The model and instructions to reproduce our results will be released at: this https URL Group/frame-prediction
视频帧预测是计算机视觉中的一个基本挑战,对自主系统、视频压缩和媒体合成等领域具有直接影响。我们提出了FG-DFPN,这是一种新颖的架构,它结合了光流估计与可变形卷积之间的协同作用来建模复杂的时空动态变化。通过利用运动线索指导可变形采样,我们的方法解决了固定核网络在处理多样化的运动模式时的局限性。多尺度设计使FG-DFPN能够同时以极高的精度捕捉全局场景变换和局部对象移动。 实验表明,FG-DFPN在八种不同的MPEG测试序列上达到了最先进的性能,在保持具有竞争力的推理速度的同时,比现有方法提高了1dB PSNR。将运动线索与自适应几何变换相结合使FG-DFPN成为了下一代视频处理系统所需的高保真时间预测的一种有前途的解决方案。 该模型及其重现结果的说明将在以下网址发布:[this https URL Group/frame-prediction](https://example.com/Group/frame-prediction)
https://arxiv.org/abs/2503.11343
Low-light and underwater videos suffer from poor visibility, low contrast, and high noise, necessitating enhancements in visual quality. However, existing approaches typically rely on paired ground truth, which limits their practicality and often fails to maintain temporal consistency. To overcome these obstacles, this paper introduces a novel zero-shot learning approach named Zero-TIG, leveraging the Retinex theory and optical flow techniques. The proposed network consists of an enhancement module and a temporal feedback module. The enhancement module comprises three subnetworks: low-light image denoising, illumination estimation, and reflection denoising. The temporal enhancement module ensures temporal consistency by incorporating histogram equalization, optical flow computation, and image warping to align the enhanced previous frame with the current frame, thereby maintaining continuity. Additionally, we address color distortion in underwater data by adaptively balancing RGB channels. The experimental results demonstrate that our method achieves low-light video enhancement without the need for paired training data, making it a promising and applicable method for real-world scenario enhancement.
低光和水下视频通常存在可视性差、对比度低以及噪声高的问题,这需要对视觉质量进行提升。然而,现有的方法通常依赖于配对的真实数据,这限制了其实用性,并且往往无法保持时间一致性。为了解决这些问题,本文提出了一种名为Zero-TIG的零样本学习新方法,该方法利用Retinex理论和光学流技术。所提出的网络包含一个增强模块和一个时间反馈模块。 增强模块由三个子网络组成:低光图像去噪、照明估计以及反射去噪。而时间增强模块则通过引入直方图均衡化、光学流计算及图像变形来使增强后的前一帧与当前帧对齐,从而确保时间一致性。此外,我们还通过对RGB通道进行自适应平衡处理来解决水下数据中的颜色失真问题。 实验结果表明,我们的方法在无需配对训练数据的情况下实现了低光视频的增强效果,使其成为一种有前景且适用于实际场景的技术方案。
https://arxiv.org/abs/2503.11175
Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our project page is this https URL.
在神经辐射场(NeRF)中,不依赖姿态先验的情况下进行准确的场景重建颇具挑战性,因为这涉及到固有的几何模糊性。近期的研究成果要么依靠对应关系先验来进行正则化处理,要么利用现成的流估计器来推导分析姿势。然而,在统一的神经表示框架内同时学习场景几何、相机姿态和稠密光流的潜力尚未被充分探索。 本文中,我们提出了Flow-NeRF,这是一个统一体系结构,可以实时优化场景几何、相机姿态及密集光学流。为了使神经辐射场内能够进行稠密光流的学习,我们设计并构建了一个基于姿势条件下的双射映射来进行流动估计。为使场景重建从中受益,我们开发了一种有效的特征增强机制,将规范空间中的特征传递到世界空间表示中,显著增强了场景几何。 我们在四个重要任务上验证了我们的模型:新颖视图合成、深度估计、相机姿态预测及稠密光学流估计,使用了几组数据集。在几乎所有的指标下,我们的方法在新视角合成和深度估计方面均超越了先前的方法,并且在新的视角流动的定性质量和定量准确性上取得了显著效果。 项目主页:[此链接](https://这个URL请替换为实际提供的URL)
https://arxiv.org/abs/2503.10464
Our study focuses on isolating swallowing dynamics from interfering patient motion in videofluoroscopy, an X-ray technique that records patients swallowing a radiopaque bolus. These recordings capture multiple motion sources, including head movement, anatomical displacements, and bolus transit. To enable precise analysis of swallowing physiology, we aim to eliminate distracting motion, particularly head movement, while preserving essential swallowing-related dynamics. Optical flow methods fail due to artifacts like flickering and instability, making them unreliable for distinguishing different motion groups. We evaluated markerless tracking approaches (CoTracker, PIPs++, TAP-Net) and quantified tracking accuracy in key medical regions of interest. Our findings show that even sparse tracking points generate morphing displacement fields that outperform leading registration methods such as ANTs, LDDMM, and VoxelMorph. To compare all approaches, we assessed performance using MSE and SSIM metrics post-registration. We introduce a novel motion correction pipeline that effectively removes disruptive motion while preserving swallowing dynamics and surpassing competitive registration techniques. Code will be available after review.
我们的研究专注于从视频透视造影(一种利用X射线记录患者吞咽不透射线剂的过程)中分离出吞咽动力学和干扰性患者运动。这些录像捕捉了包括头部移动、解剖位移以及食物团块通过的多种动作源。为了进行精确的吞咽生理分析,我们旨在消除分散注意力的动作(特别是头部移动),同时保留与吞咽相关的重要动态变化。由于光学流方法会因闪烁和不稳定等伪影而失效,无法可靠地区分不同的运动组,因此不适合使用。我们评估了无标记跟踪方法(CoTracker、PIPs++、TAP-Net)并在关键医学兴趣区域量化了它们的跟踪准确性。我们的研究结果表明,即使稀疏的跟踪点也能生成形变位移场,并且这些位移场的表现优于ANTS、LDDMM和VoxelMorph等领先的配准方法。 为了比较所有方法,我们在注册后使用均方误差(MSE)和结构相似性指数(SSIM)指标来评估性能。我们引入了一种新的运动校正流水线,该流水线能够有效去除干扰动作的同时保留吞咽动力学,并超越了竞争性的配准技术。在审核之后,代码将对外提供。
https://arxiv.org/abs/2503.10260
Spiking Neural Networks (SNNs) have emerged as a promising tool for event-based optical flow estimation tasks due to their ability to leverage spatio-temporal information and low-power capabilities. However, the performance of SNN models is often constrained, limiting their application in real-world scenarios. In this work, we address this gap by proposing a novel neural network architecture, ST-FlowNet, specifically tailored for optical flow estimation from event-based data. The ST-FlowNet architecture integrates ConvGRU modules to facilitate cross-modal feature augmentation and temporal alignment of the predicted optical flow, improving the network's ability to capture complex motion dynamics. Additionally, to overcome the challenges associated with training SNNs, we introduce a novel approach to derive SNN models from pre-trained artificial neural networks (ANNs) through ANN-to-SNN conversion or our proposed BISNN method. Notably, the BISNN method alleviates the complexities involved in biological parameter selection, further enhancing the robustness of SNNs in optical flow estimation tasks. Extensive evaluations on three benchmark event-based datasets demonstrate that the SNN-based ST-FlowNet model outperforms state-of-the-art methods, delivering superior performance in accurate optical flow estimation across a diverse range of dynamic visual scenes. Furthermore, the inherent energy efficiency of SNN models is highlighted, establishing a compelling advantage for their practical deployment. Overall, our work presents a novel framework for optical flow estimation using SNNs and event-based data, contributing to the advancement of neuromorphic vision applications.
脉冲神经网络(SNNs)由于其能够利用时空信息和低功耗的特点,在基于事件的光流估计任务中展现出巨大的潜力。然而,SNN模型的性能通常受到限制,这影响了它们在现实场景中的应用。为此,我们提出了一种新的神经网络架构ST-FlowNet,专门用于从基于事件的数据进行光流估算。ST-FlowNet架构整合了ConvGRU模块来促进跨模态特征增强和预测光流的时间对齐,从而提高了捕捉复杂运动动态的能力。 此外,为了克服训练SNN的挑战,我们提出了一种新颖的方法,通过将预训练的人工神经网络(ANNs)转换为SNN模型或我们的BISNN方法来推导SNN模型。值得注意的是,BISNN方法减轻了与生物参数选择相关的复杂性,进一步增强了SNN在光流估算任务中的鲁棒性。 我们在三个基准事件数据集上进行了广泛评估,结果表明基于SNN的ST-FlowNet模型超越了现有的最优方法,在各种动态视觉场景中实现了精确的光流估计。此外,还强调了SNN模型固有的能效优势,这为它们的实际部署提供了令人信服的理由。 总的来说,我们的工作提出了一种新颖框架,用于使用SNN和基于事件的数据进行光流估算,并且对神经形态视觉应用的发展做出了贡献。
https://arxiv.org/abs/2503.10195
Automatic Video Object Segmentation (AVOS) refers to the task of autonomously segmenting target objects in video sequences without relying on human-provided annotations in the first frames. In AVOS, the use of motion information is crucial, with optical flow being a commonly employed method for capturing motion cues. However, the computation of optical flow is resource-intensive, making it unsuitable for real-time applications, especially on edge devices with limited computational resources. In this study, we propose using frame differences as an alternative to optical flow for motion cue extraction. We developed an extended U-Net-like AVOS model that takes a frame on which segmentation is performed and a frame difference as inputs, and outputs an estimated segmentation map. Our experimental results demonstrate that the proposed model achieves performance comparable to the model with optical flow as an input, particularly when applied to videos captured by stationary cameras. Our results suggest the usefulness of employing frame differences as motion cues in cases with limited computational resources.
自动视频对象分割(AVOS)是指在不依赖于初始帧中人工提供的注释的情况下,自主地将目标对象从视频序列中分离出来的任务。在AVOS中,使用运动信息至关重要,其中光学流是一种常用的方法来捕捉运动线索。然而,计算光学流需要大量的资源,这使得它不适合实时应用,尤其是在计算资源有限的边缘设备上。 在这项研究中,我们提出了一种替代方法——利用帧差代替光学流来进行运动线索提取。为此,我们开发了一个扩展版的类似于U-Net的AVOS模型,该模型以要进行分割的一帧图像和一个帧差作为输入,并输出估算出的对象分割图。我们的实验结果显示,在处理由固定摄像机捕获的视频时,所提出的模型能够实现与使用光学流输入的模型相当的表现效果。 研究结果表明,在计算资源有限的情况下,采用帧差作为运动线索是有效且实用的方法。
https://arxiv.org/abs/2503.09132
Burst image processing (BIP), which captures and integrates multiple frames into a single high-quality image, is widely used in consumer cameras. As a typical BIP task, Burst Image Super-Resolution (BISR) has achieved notable progress through deep learning in recent years. Existing BISR methods typically involve three key stages: alignment, upsampling, and fusion, often in varying orders and implementations. Among these stages, alignment is particularly critical for ensuring accurate feature matching and further reconstruction. However, existing methods often rely on techniques such as deformable convolutions and optical flow to realize alignment, which either focus only on local transformations or lack theoretical grounding, thereby limiting their performance. To alleviate these issues, we propose a novel framework for BISR, featuring an equivariant convolution-based alignment, ensuring consistent transformations between the image and feature domains. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain in a theoretically sound way, effectively improving alignment accuracy. Additionally, we design an effective reconstruction module with advanced deep architectures for upsampling and fusion to obtain the final BISR result. Extensive experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
突发图像处理(BIP)通过捕捉并整合多个帧来生成高质量的单张图像,在消费级相机中得到了广泛应用。作为典型的BIP任务之一,突发图像超分辨率(BISR)近年来在深度学习领域取得了显著进展。现有的BISR方法通常涉及三个关键阶段:对齐、上采样和融合,并且这些阶段往往以不同的顺序和实现方式出现。在这几个阶段中,对齐尤其重要,因为它确保了准确的特征匹配,并为进一步的重建奠定了基础。然而,现有方法通常依赖于如可变形卷积或光流技术来实现对齐,要么只关注局部变换,要么缺乏理论依据,从而限制了它们的表现力。 为了缓解这些问题,我们提出了一种新的BISR框架,该框架基于等变卷积实现了对齐操作,确保图像域和特征域之间的转换保持一致。这使得可以通过显式监督在图像域内学习到的对齐变换能够在理论上可靠地应用到特征域中,从而有效地提高了对齐精度。此外,我们设计了一个有效的重建模块,采用先进的深度架构来实现上采样和融合操作,以获得最终的BISR结果。 广泛的实验表明,在突发图像超分辨率基准测试中,我们的方法在定量指标和视觉质量方面均表现出色。
https://arxiv.org/abs/2503.08300
Optical flow estimation based on deep learning, particularly the recently proposed top-performing methods that incorporate the Transformer, has demonstrated impressive performance, due to the Transformer's powerful global modeling capabilities. However, the quadratic computational complexity of attention mechanism in the Transformers results in time-consuming training and inference. To alleviate these issues, we propose a novel MambaFlow framework that leverages the high accuracy and efficiency of Mamba architecture to capture features with local correlation while preserving its global information, achieving remarkable performance. To the best of our knowledge, the proposed method is the first Mamba-centric architecture for end-to-end optical flow estimation. It comprises two primary contributed components, both of which are Mamba-centric: a feature enhancement Mamba (FEM) module designed to optimize feature representation quality and a flow propagation Mamba (FPM) module engineered to address occlusion issues by facilitate effective flow information dissemination. Extensive experiments demonstrate that our approach achieves state-of-the-art results, despite encountering occluded regions. On the Sintel benchmark, MambaFlow achieves an EPE all of 1.60, surpassing the leading 1.74 of GMFlow. Additionally, MambaFlow significantly improves inference speed with a runtime of 0.113 seconds, making it 18% faster than GMFlow. The source code will be made publicly available upon acceptance of the paper.
基于深度学习的光流估计,特别是最近提出的采用Transformer的高性能方法,由于Transformer强大的全局建模能力而表现出色。然而,Transformer中注意力机制的二次计算复杂度导致训练和推理耗时长。为缓解这些问题,我们提出了一种新颖的MambaFlow框架,该框架利用Mamba架构的高准确性和效率来捕捉局部相关性特征的同时保留其全局信息,从而实现卓越性能。据我们所知,这是首个以Mamba为中心、用于端到端光流估计的架构。它主要由两个核心组件构成:一个为优化特征表示质量而设计的特征增强型Mamba(FEM)模块,以及一个旨在通过促进有效的流信息传播来解决遮挡问题的流动传播型Mamba(FPM)模块。 广泛的实验表明,在遇到遮挡区域的情况下,我们的方法仍能实现最先进的结果。在Sintel基准测试中,MambaFlow取得了1.60的EPE值,超过了领先的GMFlow 1.74的成绩。此外,MambaFlow显著提升了推理速度,运行时间为0.113秒,比GMFlow快了18%。 我们的论文被接受后,源代码将公开发布。
https://arxiv.org/abs/2503.07046
High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.
高动态场景光流估算是一项具有挑战性的任务,由于帧图像中存在大的位移导致空间模糊和时间上不连续的运动,从而恶化了光流的空间时间和时间特征。目前的方法通常引入事件相机直接融合两种模式之间的时空特性。然而,这种直接融合的效果不佳,因为框架和事件模态之间异构数据表示的存在造成了较大的差距。为了解决这个问题,我们探索了一个共同潜在空间作为缓解模态差异的中间桥梁。在这项工作中,我们提出了一种新颖的帧与事件模态之间的通用时空融合方法,用于高动态场景光流估算,包括视觉边界定位和运动相关性融合。 具体来说,在视觉边界定位中,我们发现帧和事件在时空梯度上具有相似性,这种相似性的分布与提取出的边界分布一致。这促使我们设计了一种共同的时空梯度来约束参考边界的定位。在运动关联融合中,我们发现基于帧的运动具有空间密集但时间不连续的相关性,而基于事件的运动则具有空间稀疏但时间连续的相关性。这启发我们将参照边界用作指导两种模态之间互补运动知识融合的方法。 此外,通用时空融合不仅可以缓解跨模式特征差异,还能使光流稠密且连续的融合过程变得可解释。我们通过广泛的实验验证了所提出方法的优势。
https://arxiv.org/abs/2503.06992
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with \textbf{S}treaming memory for dense \textbf{PO}int \textbf{T}racking and online video processing. The \textbf{SPOT} framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$\times$ smaller parameter numbers operates at least 2$\times$ faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: this https URL.
稠密点跟踪是一项具有挑战性的任务,要求在整个视频的大量帧中持续追踪初始帧中的每一个点,即使在存在遮挡的情况下也是如此。传统方法使用光学流模型来直接估计长距离运动,但它们通常会出现外观漂移问题,并且不考虑时间一致性。最近的点跟踪算法通常依赖于滑动窗口来间接传播信息,从第一帧到当前帧,这种方法对于长时间范围内的跟踪来说效率低下且效果较差。 为了考虑到时间一致性并实现高效的信息传递,我们提出了一种轻量级快速模型——用于稠密点跟踪和在线视频处理的**流式记忆(Streaming memory for dense POint Tracking)框架**(SPOT)。该框架包含三个核心组件:一个定制的记忆读取模块,用于特征增强;一个感觉记忆模块,用于短期运动动态建模;以及一个基于可见性引导的绘制模块,用于准确的信息传播。这种组合使SPOT能够实现具有CVO基准测试中最新技术水平精度的稠密点跟踪,并且在稀疏跟踪基准(如TAP-Vid和RoboTAP)上的性能与离线模型相当或更优。 特别值得注意的是,在参数数量仅为前一代最佳模型十分之一的情况下,SPOT的速度至少提高了两倍,并保持了CVO中的最佳性能。我们将在此网址发布我们的模型和代码:[此链接](this https URL)。
https://arxiv.org/abs/2503.06471
Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.
视频修复(Video Inpainting),旨在恢复被损坏的视频内容,已经取得了显著的进步。尽管如此,现有的方法仍然面临挑战:无论是通过光流和感受野先验传播未遮罩区域的像素,还是将图像修复模型扩展到时间维度,都难以生成完全遮盖的对象或在同一模型中平衡背景保持与前景生成的竞争目标。为了解决这些局限性,我们提出了一种新颖的双通道范式VideoPainter,该方法结合了一个高效的上下文编码器(仅占骨干参数的6%)来处理被掩码的视频,并向任何预训练的视频DiT注入基于骨干的背景上下文线索,从而以即插即用的方式生成语义一致的内容。这种架构分离显著降低了模型的学习复杂性,同时允许对关键背景上下文进行细致整合。我们还引入了一种新颖的目标区域ID重新采样技术,该技术支持任意长度视频修复,极大地提高了其实用性。此外,我们建立了一个可扩展的数据集管道,利用当前的视觉理解模型,贡献了VPData和VPBench以促进基于分割的修复训练与评估,并建立了迄今为止最大的视频修复数据集和基准测试平台,其中包含超过39万段多样化的片段。通过使用修复作为基础流程,我们也探索了一系列下游应用,包括视频编辑及视频编辑对数据生成,展示了竞争力的表现和重要的实际潜力。广泛的实验表明,在八项关键指标(包括视频质量、掩码区域保持以及文本一致性)上,VideoPainter在任意长度的视频修复和编辑方面均表现出优越性能。
https://arxiv.org/abs/2503.05639
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
本文介绍了Stereo Any Video,这是一个强大的视频立体匹配框架。它能够在不依赖辅助信息(如相机姿态或光学流量)的情况下估计出空间准确且时间一致的视差。这种强大能力是由单目视频深度模型提供的丰富先验驱动的,并与卷积特征结合生成稳定的表示。为了进一步提升性能,引入了关键的架构创新:全对全匹配相关性,构建平滑且稳健的匹配成本体;以及时间凸插值上采样技术,增强时间一致性。这些组件共同确保了鲁棒性、准确性和时间连贯性,在视频立体匹配领域树立了一个新的标准。广泛的实验表明,我们的方法在零样本设置下跨多个数据集实现了当前最佳性能,并且对现实世界室内外场景具有强大的泛化能力。
https://arxiv.org/abs/2503.05549
We present a novel approach for super-resolution that utilizes implicit neural representation (INR) to effectively reconstruct and enhance low-resolution videos and images. By leveraging the capacity of neural networks to implicitly encode spatial and temporal features, our method facilitates high-resolution reconstruction using only low-resolution inputs and a 3D high-resolution grid. This results in an efficient solution for both image and video super-resolution. Our proposed method, SR-INR, maintains consistent details across frames and images, achieving impressive temporal stability without relying on the computationally intensive optical flow or motion estimation typically used in other video super-resolution techniques. The simplicity of our approach contrasts with the complexity of many existing methods, making it both effective and efficient. Experimental evaluations show that SR-INR delivers results on par with or superior to state-of-the-art super-resolution methods, while maintaining a more straightforward structure and reduced computational demands. These findings highlight the potential of implicit neural representations as a powerful tool for reconstructing high-quality, temporally consistent video and image signals from low-resolution data.
我们提出了一种新颖的超分辨率方法,该方法利用隐式神经表示(INR)来有效重建和增强低分辨率视频和图像。通过利用神经网络在隐式编码空间和时间特征方面的能力,我们的方法能够仅使用低分辨率输入和三维高分辨率网格就实现高质量的高清重构。这种方法为图像和视频超分辨率提供了一个高效的解决方案。我们提出的SR-INR方法能够在帧之间保持一致的细节,并且无需依赖计算密集型的光流或运动估计技术,即可达到令人印象深刻的时序稳定性。相比之下,我们的方法结构简单,与许多现有复杂的方法形成鲜明对比,使其既有效又高效。实验评估表明,SR-INR在结果上可与或超越当前最先进的超分辨率方法相媲美,同时保持了更为简单的结构和减少的计算需求。这些发现突显了隐式神经表示作为从低分辨率数据重建高质量、时序一致视频和图像信号的强大工具的潜力。
https://arxiv.org/abs/2503.04665
Optical flow is a fundamental technique for motion estimation, widely applied in video stabilization, interpolation, and object tracking. Recent advancements in artificial intelligence (AI) have enabled deep learning models to leverage optical flow as an important feature for motion analysis. However, traditional optical flow methods rely on restrictive assumptions, such as brightness constancy and slow motion constraints, limiting their effectiveness in complex scenes. Deep learning-based approaches require extensive training on large domain-specific datasets, making them computationally demanding. Furthermore, optical flow is typically visualized in the HSV color space, which introduces nonlinear distortions when converted to RGB and is highly sensitive to noise, degrading motion representation accuracy. These limitations inherently constrain the performance of downstream models, potentially hindering object tracking and motion analysis tasks. To address these challenges, we propose Reynolds flow, a novel training-free flow estimation inspired by the Reynolds transport theorem, offering a principled approach to modeling complex motion dynamics. Beyond the conventional HSV-based visualization, denoted ReynoldsFlow, we introduce an alternative representation, ReynoldsFlow+, designed to improve flow visualization. We evaluate ReynoldsFlow and ReynoldsFlow+ across three video-based benchmarks: tiny object detection on UAVDB, infrared object detection on Anti-UAV, and pose estimation on GolfDB. Experimental results demonstrate that networks trained with ReynoldsFlow+ achieve state-of-the-art (SOTA) performance, exhibiting improved robustness and efficiency across all tasks.
光流是一种基本的运动估计技术,在视频稳定、插值和目标跟踪等方面广泛应用。近年来,人工智能(AI)的发展使得深度学习模型能够利用光流作为运动分析的重要特征。然而,传统的光流方法依赖于亮度恒定和缓慢移动等限制性假设,这在复杂场景中限制了它们的效果。基于深度学习的方法需要在大量的领域特定数据集上进行广泛的训练,使其计算成本高昂。此外,光流通常使用HSV颜色空间进行可视化,在转换为RGB时会产生非线性扭曲,并且对噪声高度敏感,降低了运动表示的准确性。这些局限性会制约下游模型的表现,可能影响目标跟踪和运动分析任务的效果。 为了应对这些挑战,我们提出了Reynolds流,这是一种基于雷诺输运定理的无需训练的流估计方法,为建模复杂的运动动态提供了一种原理性的途径。除了传统的HSV基可视化之外,我们引入了另一种表示形式ReynoldsFlow+,旨在改进流可视化效果。 我们在三个基于视频的基准测试上评估了ReynoldsFlow和ReynoldsFlow+:UAVDB上的小型目标检测、Anti-UAV红外目标检测以及GolfDB上的姿态估计。实验结果表明,使用ReynoldsFlow+训练的网络在所有任务中均达到了最先进的(SOTA)性能,并表现出更强的鲁棒性和效率。
https://arxiv.org/abs/2503.04500
We present a generic video super-resolution algorithm in this paper, based on the Diffusion Posterior Sampling framework with an unconditional video generation model in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Due to limited computational resources and training data, our experiments provide empirical evidence of the algorithm's strong super-resolution capabilities using synthetic data.
在本文中,我们提出了一种基于扩散后验采样框架和潜在空间中的无条件视频生成模型的通用视频超分辨率算法。该视频生成模型是一种扩散变换器,它作为时空模型发挥作用。我们认为,一个强大的模型,能够学习现实世界的物理规律,可以很容易地处理各种运动模式,并将这些知识作为先验知识加以利用,从而无需显式估计光流或运动参数来进行像素对齐。此外,所提出的视频扩散变换器模型的一个实例可以在不重新训练的情况下适应不同的采样条件。由于计算资源和训练数据的限制,我们的实验使用合成数据提供了该算法强大超分辨率能力的经验证据。
https://arxiv.org/abs/2503.03355
Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank $1^{st}$ on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT's warm-start approach. Code: \textcolor{magenta}{this https URL}.
事件相机能够提供具有高动态范围和高时间分辨率的视觉信息,在复杂光照条件及快速移动物体的光流估计中展现出显著优势。当前先进的基于事件相机的光流方法主要采用已有的图像为基础的方法框架,然而事件数据的空间稀疏性限制了这些方法的性能表现。 在本文中,我们提出了一种创新性的框架BAT(双向自适应时间相关),用于利用事件数据进行光学流估计。BAT包括三个新颖的设计: 1. 双向时间相关机制将双方向的时间密集运动线索转换为空间密集线索,从而实现准确且空间密集的光流估算。 2. 一种自适应时间采样策略以保持相关性的时间一致性。 3. 空间自适应时间运动聚合技术高效地、自适应地聚集一致的目标运动特征到相邻运动特征中,并抑制不一致的特征。 我们的研究结果在DSEC-Flow基准测试中排名第一,显著优于现有的最先进方法,在边缘清晰度和细节质量方面也表现出色。特别值得一提的是,BAT能够仅利用过去的事件准确预测未来的光流,这远远超过了E-RAFT的预热启动策略的表现。 代码链接: [此处插入具体网址] (请将“this https URL”替换为实际提供的GitHub或其他平台上的链接)
https://arxiv.org/abs/2503.03256