Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
在长时间未剪辑的视频中定位和分类精细粒度的任务子段对于安全的人机协作至关重要。与通用活动识别不同,协作操作需要可以直接由机器人执行的任务子标签。我们提出了一种多阶段的人类到机器人的任务分割框架RoboSubtaskNet,该框架结合了增强注意力机制的I3D特征(RGB和光流)以及采用斐波那契膨胀时间表修改后的MS-TCN架构,以更好地捕捉如“伸手-抓取-放置”这类短时域内的转换。网络通过一个包含交叉熵和时间正则化器(截断MSE和过渡感知项)的复合目标进行训练,以减少过度分割并鼓励有效的子任务进展。 为了弥合视觉基准与控制之间的差距,我们引入了RoboSubtask数据集,该数据集包含了医疗保健和工业演示,并在子任务级别进行了注释,旨在确定性地映射到机械臂的基本操作。经验表明,在GTEA和我们的RoboSubtask基准测试(边界敏感性和序列度量)上,RoboSubtaskNet的表现优于MS-TCN和MS-TCN++,同时在长时域的Breakfast数据集上也具有竞争力。具体而言,RoboSubtaskNet在GTEA上的F1@50 = 79.5%,Edit = 88.6%,Acc = 78.9%;在Breakfast上的F1@50 = 30.4%,Edit = 52.0%,Acc = 53.5%;以及在RoboSubtask上的F1@50 = 94.2%,Edit = 95.6%,Acc = 92.2%。我们进一步在一个7自由度的Kinova Gen3机械臂上验证了整个感知到执行的管道,在物理试验中实现了可靠的整体端到端行为(总体任务成功率约为91.25%)。这些结果展示了从子任务级别视频理解到现实世界环境中部署机器人操作的实际路径。 上述内容翻译自原文,详细描述了RoboSubtaskNet框架及其在不同数据集上的性能表现。
https://arxiv.org/abs/2602.10015
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
基于语义动作检索视频是一项基本但尚未解决的问题。现有的视频表示方法过于依赖静态外观和场景上下文,而忽视了动态动作,这种偏见源自它们的训练数据和目标设定。相比之下,传统的以动作为中心的输入(如光流)缺乏理解高层次动作所需的语义背景。为了展示这一内在偏见,我们引入了SimMotion基准测试集,它结合了受控合成数据与新的、由人类标注的真实世界数据集。我们发现现有模型在这些基准上表现不佳,经常无法将运动和外观分开。 为了解决这个差距,我们提出了SemanticMoments,这是一种简单且无需训练的方法,通过对预训练语义模型的特征计算时间统计(特别是高阶矩)。在我们的基准测试中,SemanticMoments始终优于现有的RGB、光流以及文本监督方法。这表明,在语义特征空间中的时间统计数据为以动作为中心的视频理解提供了一个可扩展且感知基础良好的框架。
https://arxiv.org/abs/2602.09146
We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.
我们介绍了一种新的框架FLAG-4D,用于通过重建三维高斯原语在空间和时间中的演变来生成动态场景的新视角。现有的方法通常依赖于单个多层感知器(MLP)来建模时序变形,它们往往难以一致地捕捉复杂点的运动以及精细的时间细节,尤其是在从稀疏输入视图中更是如此。我们的方法FLAG-4D通过采用一个双重形变网络解决了这个问题,该网络能够动态地将一组规范化的三维高斯模型随着时间推移转换为新的位置和各向异性形状。这个双重变形网络由即时变形网络(IDN)组成,用于建模细粒度的局部变形,并且还包含全局运动网络(GMN),用于捕捉长距离的动力学变化,通过相互学习进行优化。 为了确保这些形变既准确又在时间上平滑,FLAG-4D从一个预训练的光流主干网中整合了密集的运动特征。我们融合来自相邻时间段的运动线索,并利用一种变形引导注意机制将此流动信息与每个演变中的三维高斯模型当前状态对齐。 通过广泛的实验表明,FLAG-4D在保真度和时间上的一致性方面超越了最先进的方法,且能够更好地保留细节。
https://arxiv.org/abs/2602.08558
Deep learning has the potential to improve colonoscopy by enabling 3D reconstruction of the colon, providing a comprehensive view of mucosal surfaces and lesions, and facilitating the identification of unexplored areas. However, the development of robust methods is limited by the scarcity of large-scale ground truth data. We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment. Colon geometries extracted from 10 CT scans were imported into a virtual environment that closely mimics intraoperative conditions and rendered with realistic vascular textures. The resulting dataset comprises 28\,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories. A benchmark study was conducted to evaluate the available synthetic colon datasets for the tasks of depth and pose estimation. Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images, proving it to be a powerful tool for developing deep learning algorithms to support endoscopic diagnosis.
深度学习有可能通过启用结肠的三维重建来改进结肠镜检查,提供粘膜表面和病变的整体视图,并帮助识别未探索的区域。然而,由于缺乏大规模的真实数据集,开发稳健方法的努力受到了限制。我们提出了RealSynCol,这是一个高度逼真的合成数据集,旨在复制内窥镜环境。从10次CT扫描中提取的结肠几何形状被导入到一个虚拟环境中,该环境紧密模拟了术中的条件,并用逼真的血管纹理进行了渲染。生成的数据集中包含28,130帧图像,每张图像都配有一份地面实况深度图、光流数据、三维网格和相机轨迹。 为了评估现有的合成结肠数据集在深度估计和姿态估计任务上的性能,我们进行了一项基准研究。结果显示,RealSynCol的高逼真度和多样性显著提升了在临床图像上的一般化性能,证明它是一种开发支持内窥镜诊断的深度学习算法的强大工具。
https://arxiv.org/abs/2602.08397
Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.
主流的视觉惯性里程计(VIO)系统依赖于点特征来进行运动估计和定位,但在挑战性的场景中其性能会下降。此外,基于多状态约束卡尔曼滤波器(MSCKF)的VIO系统的定位精度因与特征3D坐标相关的线性化误差以及延迟的测量更新而受到影响。为了提高在困难环境中的VIO性能,我们首先提出了一种仅基于姿态的几何表示方法用于直线特征。在此基础上,我们开发了POPL-KF系统,这是一种卡尔曼滤波器(Kalman filter)基的VIO系统,它为点和直线特征都采用了一种仅考虑姿态的几何表示方法。通过从测量方程中显式地消除点和线特征坐标,POPL-KF减少了线性化误差,并同时实现了视觉测量的即时更新功能。我们还设计了一个统一的基础帧选择算法,该算法适用于点和线特征,以确保在仅基于姿态的测量模型下对相机姿态施加最优约束条件。为了进一步提高直线特征的质量,提出了一种基于图像网格分割和双向光流一致性的直线特征过滤器。 我们的系统已在公开数据集及真实世界实验中进行了评估,结果表明POPL-KF优于现有的滤波方法(如OpenVINS、PO-KF)以及优化方法(如PL-VINS、EPLF-VINS),同时保持了实时性能。
https://arxiv.org/abs/2602.06425
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: this https URL
视频融合是多种视频处理任务中的基本技术。然而,现有的视频融合方法严重依赖于光流估计和特征变换,这导致了严重的计算开销并且限制了其可扩展性。本文提出了MambaVF,这是一种基于状态空间模型(SSMs)的高效视频融合框架,在不进行显式运动估算的情况下执行时间建模。 首先,通过将视频融合重新表述为一个序列化的状态更新过程,MambaVF能够以线性复杂度捕捉长时间的时序依赖关系,并且显著减少了计算和内存成本。其次,MambaVF提出了一种基于SSM的轻量级融合模块,该模块通过时空双向扫描机制替代传统的流引导对齐方式。这种模块使得跨帧高效的信息聚合成为可能。 多项基准测试实验显示,我们的MambaVF在多曝光、多焦点、红外-可见光和医学视频融合任务中均达到了最先进的性能表现。特别地,我们强调MambaVF具有极高的效率:与现有的方法相比,它可以减少高达92.25%的参数数量和88.79%的计算FLOPs,并且速度提高了2.1倍。 项目页面: [此URL](https://this-url.com) (请根据实际情况替换为实际链接) 注意:在提供的原始文本中,“Project page”后的链接没有给出,上述回答中的“[此URL]”部分是一个占位符,请使用真实的有效链接来替代。
https://arxiv.org/abs/2602.06017
Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
尽管通过3D视觉基础模型在无校准单目SLAM方面取得了近期进展,但在长序列上尺度漂移仍然严重。运动无关分区破坏了上下文连贯性,并导致零运动漂移,而传统的几何对齐则计算成本高昂。为了解决这些问题,我们提出了VGGT-Motion系统,这是一个用于实现高效且鲁棒的千米级轨迹全局一致性的无校准SLAM系统。 具体而言,我们首先提出了一种基于光学流引导自适应分区、修剪静态冗余并封装转弯以保持稳定局部几何结构的运动感知子图构建机制。然后,我们设计了一个由锚点驱动的直接Sim(3)注册策略。通过利用平衡上下文信息的锚点,该策略实现了无搜索的像素级密集对齐和高效的闭环检测,而无需昂贵的特征匹配操作。最后,一种轻量级的姿态图优化方法在子地图级别上以线性复杂度强制全局一致性,从而支持可扩展的长距离操作。 实验表明,VGGT-Motion显著提高了轨迹的准确性和效率,在零样本、远程无校准单目SLAM中达到了最先进的性能。
https://arxiv.org/abs/2602.05508
Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
密集点跟踪是计算机视觉中的一个基本问题,其应用范围从视频分析到机器人操作。最先进的追踪器通常依赖于成本体积(cost volumes)来跨帧匹配特征,但这种方法在空间分辨率上带来了二次复杂度的开销,从而限制了可扩展性和效率。在这篇论文中,我们提出了\method,这是一种新型的密集点跟踪器,它摒弃了传统的成本体积方法,转而采用变形操作(warping)。受最近光学流进展的启发,我们的方法通过基于当前估计将目标帧中的特征变形到查询帧上来迭代地细化追踪估算。结合能够执行所有跟踪路径上时空联合推理的变压器架构,我们设计的方法能够在不计算特征相关性的情况下建立长距离对应关系。模型的设计简洁,并在标准密集点跟踪基准测试(如TAP-Vid-DAVIS、TAP-Vid-Kinetics和Robo-TAP)中实现了最先进的性能。值得注意的是,该模型还在光学流领域表现出色,在Sintel、KITTI和Spring等基准测试中的表现有时甚至超过了专门的方法。这些结果表明基于变形的架构可以将密集点跟踪与光流估计统一起来。
https://arxiv.org/abs/2602.04877
In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at this https URL.
https://arxiv.org/abs/2602.01724
Video relighting offers immense creative potential and commercial value but is hindered by challenges, including the absence of an adequate evaluation metric, severe light flickering, and the degradation of fine-grained details during editing. To overcome these challenges, we introduce Hi-Light, a novel, training-free framework for high-fidelity, high-resolution, robust video relighting. Our approach introduces three technical innovations: lightness prior anchored guided relighting diffusion that stabilises intermediate relit video, a Hybrid Motion-Adaptive Lighting Smoothing Filter that leverages optical flow to ensure temporal stability without introducing motion blur, and a LAB-based Detail Fusion module that preserves high-frequency detail information from the original video. Furthermore, to address the critical gap in evaluation, we propose the Light Stability Score, the first quantitative metric designed to specifically measure lighting consistency. Extensive experiments demonstrate that Hi-Light significantly outperforms state-of-the-art methods in both qualitative and quantitative comparisons, producing stable, highly detailed relit videos.
https://arxiv.org/abs/2601.23167
Mechanical properties of red blood cells (RBCs) are promising biomarkers for hematologic and systemic disease, motivating microfluidic assays that probe deformability at throughputs of $10^3$--$10^6$ cells per experiment. However, existing pipelines rely on supervised segmentation or hand-crafted kymographs and rarely encode the laminar Stokes-flow physics that governs RBC shape evolution. We introduce FlowMorph, a physics-consistent self-supervised framework that learns a label-free scalar mechanics proxy $k$ for each tracked RBC from short brightfield microfluidic videos. FlowMorph models each cell by a low-dimensional parametric contour, advances boundary points through a differentiable ''capsule-in-flow'' combining laminar advection and curvature-regularized elastic relaxation, and optimizes a loss coupling silhouette overlap, intra-cellular flow agreement, area conservation, wall constraints, and temporal smoothness, using only automatically derived silhouettes and optical flow. Across four public RBC microfluidic datasets, FlowMorph achieves a mean silhouette IoU of $0.905$ on physics-rich videos with provided velocity fields and markedly improves area conservation and wall violations over purely data-driven baselines. On $\sim 1.5\times 10^5$ centered sequences, the scalar $k$ alone separates tank-treading from flipping dynamics with an AUC of $0.863$. Using only $200$ real-time deformability cytometry (RT-DC) events for calibration, a monotone map $E=g(k)$ predicts apparent Young's modulus with a mean absolute error of $0.118$\,MPa on $600$ held-out cells and degrades gracefully under shifts in channel geometry, optics, and frame rate.
https://arxiv.org/abs/2601.17947
Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.
运动表示在视频理解中扮演着重要角色,并且有许多应用,包括动作识别、机器人和自主导航等。最近,通过自注意力机制的能力,变压器网络在许多应用程序中证明了其有效性。在这项研究中,我们引入了一种新的双流变压器视频分类器,该分类器从内容和表示运动信息的光学流中提取时空信息。所提出的模型在联合光流和时间帧域内识别自注意特征,并通过变压器编码机制表示它们之间的关系。实验结果表明,在三个著名的涉及人类活动的视频数据集上,我们提出的方法提供了出色的分类效果。
https://arxiv.org/abs/2601.14086
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
Autonomous navigation for nano-scale unmanned aerial vehicles (nano-UAVs) is governed by extreme Size, Weight, and Power (SWaP) constraints (with the weight < 50 g and sub-100 mW onboard processor), distinguishing it fundamentally from standard robotic paradigms. This review synthesizes the state-of-the-art in sensing, computing, and control architectures designed specifically for these sub- 100mW computational envelopes. We critically analyse the transition from classical geometry-based methods to emerging "Edge AI" paradigms, including quantized deep neural networks deployed on ultra-low-power System-on-Chips (SoCs) and neuromorphic event-based control. Beyond algorithms, we evaluate the hardware-software co-design requisite for autonomy, covering advancements in dense optical flow, optimized Simultaneous Localization and Mapping (SLAM), and learning-based flight control. While significant progress has been observed in visual navigation and relative pose estimation, our analysis reveals persistent gaps in long-term endurance, robust obstacle avoidance in dynamic environments, and the "Sim-to-Real" transfer of reinforcement learning policies. This survey provides a roadmap for bridging these gaps, advocating for hybrid architectures that fuse lightweight classical control with data-driven perception to enable fully autonomous, agile nano-UAVs in GPS-denied environments.
纳米级无人飞行器(nano-UAV)的自主导航受极端尺寸、重量和功耗(SWaP)限制的影响,其重量小于50克且机载处理器功率低于100毫瓦,这与标准机器人范式有根本区别。这篇综述总结了为这些低至100毫瓦计算能力设计的传感、计算及控制架构的最新进展。我们批判性地分析了从传统几何方法向新兴“边缘AI”(Edge AI)范式的转变,包括在超低功耗片上系统(SoCs)上部署量化深度神经网络以及基于事件的神经形态控制。除了算法之外,还评估了实现自主性的硬件-软件协同设计需求,涵盖了密集光流、优化的同时定位与地图构建(SLAM)和学习型飞行控制的进步。 尽管在视觉导航和相对姿态估计方面已经取得了显著进展,但我们的分析揭示了长期续航能力不足、动态环境中的鲁棒性避障以及强化学习策略的“仿真到实际”迁移等方面的持续差距。本调查提供了弥合这些差距的道路图,倡导融合轻量级经典控制与数据驱动感知的混合架构,以实现在没有全球定位系统(GPS)支持环境中完全自主且敏捷的纳米无人机飞行。
https://arxiv.org/abs/2601.13252
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
视频扩散模型在大规模数据集上训练后,能够自然地捕捉帧间共享特征的对应关系。最近的一些研究利用这一特性来执行零样本设置下的光流预测和跟踪任务。受这些发现启发,我们探讨了监督学习是否可以更充分地发挥视频扩散模型的跟踪能力。为此,我们提出了Moaw框架,该框架使视频扩散模型具备运动感知,并借此促进运动迁移。具体来说,我们训练了一个用于运动感知的扩散模型,将其模态从图像到视频生成转换为视频到稠密跟踪。然后构建一个带有运动标签的数据集来识别编码最强运动信息的特征,并将这些特征注入到结构相同但用于视频生成的模型中。由于两个网络之间的同质性,在零样本设置下可以自然地适应这些特征,从而无需额外适配器即可实现运动迁移。我们的工作为生成式建模和运动理解之间架起了桥梁,为更统一、可控的视频学习框架铺平了道路。
https://arxiv.org/abs/2601.12761
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
未来运动表示(如光流)对于控制和生成任务具有巨大价值。然而,预测通用的密集空间运动表示仍然是一个关键挑战,并且从嘈杂的真实世界数据中学习这种预测方法的研究相对较少。我们引入了FOFPred,这是一个新颖的语言条件下的光流预测模型,它结合了一个统一的视觉-语言模型(VLM)和扩散架构。这一独特的组合使强大的多模态推理成为可能,并实现了未来运动预测中的像素级生成准确性。我们的模型在大规模网络数据上的人类活动视频描述数据上进行训练——这是一个高度可扩展但又结构化的来源。 为了从这些嘈杂的视频-描述数据中提取有意义的信息,我们采用了关键的数据预处理技术以及统一架构和强大的图像预训练方法。然后,我们将经过训练的模型应用于控制和生成两个不同的下游任务。在机器人操作和基于语言驱动条件下的视频生成评估中,FOFPred展示了其跨领域的适用性,这证实了统一的VLM-扩散架构的价值,并证明了从多样化的网络数据中进行大规模学习对未来光流预测的重要性。
https://arxiv.org/abs/2601.10781
Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at this https URL.
由于光的波长依赖性衰减、悬浮颗粒引起的强烈散射、浑浊导致的模糊以及非均匀照明,水下成像本质上具有挑战性。这些效应影响了标准相机的功能,并使获取地面实况运动变得几乎不可能。另一方面,事件相机提供微秒级分辨率和高动态范围。然而,由于缺乏将逼真的水下光学与准确光流相结合的数据集,对事件相机在水下环境中应用的研究进展受到限制。 为了解决这个问题,我们引入了首个基于物理基础的光线追踪RGBD序列生成的合成水下基准数据集,用于事件驱动光流。通过现代视频到事件管道应用于渲染出的水下视频中,我们产生了具有密集的真实地面实况流动、深度和摄像机运动的数据流。 此外,我们对最先进的学习型和模型驱动的光流预测方法进行了基准测试,以了解水下光线传输如何影响事件形成及运动估计精度。我们的数据集为未来水下事件感知算法的发展与评估建立了新的基线标准。该项目的源代码和数据集在以下网址公开提供:[此链接](请将 [此链接] 替换实际的URL)。
https://arxiv.org/abs/2601.10054
Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.
音频-视觉语义分割(AVSS)是音频-视觉分割(AVS)任务的扩展,它要求对音频-视觉场景进行语义理解,而不仅仅是识别发出声音的对象在视觉像素层面的位置。与先前的方法不同,我们通过将AVSS任务分解为两个独立的子任务,并首先提供一个提示性分段掩模以促进后续语义分析来创新这一基础策略。为此,我们引入了一个新颖的合作框架——Stepping Stone Plus(SSP),该框架集成了光流和文本提示以辅助分割过程。 在声源经常与移动对象共存的情况下,我们的预分段技术利用光流捕捉运动动态,为精确的分割提供必要的时间上下文。为了应对静止发声物体的挑战,如闹钟,SSP引入了两个特定的文本提示:一个用于识别发出声音的对象类别,另一个则提供了场景的更广泛描述。此外,我们实施了一个视觉-文本对齐模块(VTA),以促进跨模态整合,提供更为连贯且上下文相关的语义解释。 我们的训练方案包括一种后分段技术,旨在促使模型学习光流图的结构。实验结果表明,SSP优于现有的AVS方法,在效率和精度方面均表现出色。
https://arxiv.org/abs/2601.08133
Facial optical flow supports a wide range of tasks in facial motion analysis. However, the lack of high-resolution facial optical flow datasets has hindered progress in this area. In this paper, we introduce Splatting Rasterization Flow (SRFlow), a high-resolution facial optical flow dataset, and Splatting Rasterization Guided FlowNet (SRFlowNet), a facial optical flow model with tailored regularization losses. These losses constrain flow predictions using masks and gradients computed via difference or Sobel operator. This effectively suppresses high-frequency noise and large-scale errors in texture-less or repetitive-pattern regions, enabling SRFlowNet to be the first model explicitly capable of capturing high-resolution skin motion guided by Gaussian splatting rasterization. Experiments show that training with the SRFlow dataset improves facial optical flow estimation across various optical flow models, reducing end-point error (EPE) by up to 42% (from 0.5081 to 0.2953). Furthermore, when coupled with the SRFlow dataset, SRFlowNet achieves up to a 48% improvement in F1-score (from 0.4733 to 0.6947) on a composite of three micro-expression datasets. These results demonstrate the value of advancing both facial optical flow estimation and micro-expression recognition.
面部光学流支持面部运动分析中的广泛任务。然而,缺乏高分辨率的面部光学流数据集阻碍了这一领域的发展。在本文中,我们介绍了Splatting Rasterization Flow (SRFlow),这是一个高分辨率的面部光学流数据集,以及Splatting Rasterization Guided FlowNet (SRFlowNet),这是一种专为定制正则化损失而设计的面部光学流模型。这些损失通过使用掩码和差分或Sobel算子计算的梯度来约束光流预测,从而有效地抑制了无纹理或重复图案区域中的高频噪声和大规模误差。这使得SRFlowNet成为首个能够捕捉由高斯点染光栅化引导下的高分辨率皮肤运动的模型。 实验表明,使用SRFlow数据集进行训练可以提高各种光学流模型的面部光学流估计精度,将末端点误差(EPE)最多降低42%(从0.5081降至0.2953)。此外,在与SRFlow数据集结合时,SRFlowNet在三个微表情数据集组成的综合评估中实现了高达48%的F1分数改进(从0.4733提高到0.6947)。 这些结果展示了推进面部光学流估计和微表情识别的价值。
https://arxiv.org/abs/2601.06479
We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.
我们介绍了MOSAIC-GS,这是一种新颖的、完全显式的计算效率高的方法,用于从单目视频中进行高保真动态场景重建。这种方法利用了高斯点积(Gaussian Splatting)技术。由于缺乏足够的多视角约束条件,单目重建本质上是一个病态问题,这使得准确恢复物体几何形状和时间一致性变得特别具有挑战性。为了解决这一问题,我们采用了多种几何线索,例如深度信息、光学流、动态对象分割以及点跟踪。结合基于刚体的运动限制,这些线索使我们在初始化阶段能够估计出初步的3D场景动力学。 在光度优化之前恢复场景的动力学减少了对仅从视觉外观推断运动的依赖,在单目设置中这种推断往往是模棱两可的。为了实现紧凑表示、快速训练和实时渲染,并支持非刚性变形,场景被分解为静态和动态部分。每个动态部分中的高斯点都分配了一个由时间相关的Poly-Fourier曲线表示的轨迹,以进行参数高效的运动编码。 我们证明了MOSAIC-GS在与现有方法相比时,在标准单目动态场景基准测试中实现了显著更快的优化和渲染速度,同时保持了与最先进的重建质量相当的水平。
https://arxiv.org/abs/2601.05368