This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
这项研究介绍了一种利用深度神经网络技术和跨多模态(包括RGB、光学流、音频和深度信息)的自适应融合策略进行人体动作识别的开创性方法。通过采用门控机制进行多模态融合,我们旨在超越传统单一模态识别方法固有的局限性,并探索多样化的应用可能。通过对各种门控机制及基于自适应加权的融合架构进行详尽调查,本研究的方法能够选择性地整合来自不同模式的相关信息,从而增强动作识别任务中的准确性和鲁棒性。我们仔细评估了多种门控融合策略,以确定最适合多模态行动识别的最佳方法,并展示了其相对于传统单一模态方法的优势。门控机制有助于提取关键特征,使动作的表示更加全面,显著提高了识别性能。我们在人类行为识别、暴力行为检测及多个自监督学习任务上的基准数据集评估表明,在准确性方面取得了令人鼓舞的进步。这项研究的意义在于它有可能在各个领域革新动作识别系统。多模态信息融合预示着在监控和人机交互领域的复杂应用,特别是在主动辅助生活相关的情境中具有重要意义。
https://arxiv.org/abs/2512.04943
Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at this https URL
实时在边缘设备上追踪小型无人驾驶飞行器(UAV)面临着基本的分辨率-速度冲突。将高分辨率图像降采样至标准检测模型输入尺寸会导致小目标特征消失到不可检出阈值以下。然而,在资源受限平台上处理原始1080p帧无法达到平稳云台控制所需的足够吞吐量。为此,我们提出了SDG-Track,即稀疏检测引导追踪器,采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流则通过CPU上的感兴趣区域(ROI)约束稀疏光流,在高频下进行轨迹插值。 为了处理由遮挡或与目标具有相似光谱特性的干扰物导致的追踪失败,我们引入了双空间恢复机制——这是一种无需训练的重新获取手段,结合颜色直方图匹配和几何一致性约束。在地面至空中跟踪站上的实验表明,SDG-Track能够在保持97.2%的帧间检测精度的同时实现35.1 FPS的系统吞吐量。该系统成功地在现实世界操作条件下,在NVIDIA Jetson Orin Nano上追踪敏捷的第一人称视角(FPV)无人机。 我们的研究代码可公开获取,详情请访问:[提供的链接] (请将 [提供的链接] 替换为实际的网址)
https://arxiv.org/abs/2512.04883
Autonomous landing on sloped terrain poses significant challenges for small, lightweight spacecraft, such as rotorcraft and landers. These vehicles have limited processing capability and payload capacity, which makes advanced deep learning methods and heavy sensors impractical. Flying insects, such as bees, achieve remarkable landings with minimal neural and sensory resources, relying heavily on optical flow. By regulating flow divergence, a measure of vertical velocity divided by height, they perform smooth landings in which velocity and height decay exponentially together. However, adapting this bio-inspired strategy for spacecraft landings on sloped terrain presents two key challenges: global flow-divergence estimates obscure terrain inclination, and the nonlinear nature of divergence-based control can lead to instability when using conventional controllers. This paper proposes a nonlinear control strategy that leverages two distinct local flow divergence estimates to regulate both thrust and attitude during vertical landings. The control law is formulated based on Incremental Nonlinear Dynamic Inversion to handle the nonlinear flow divergence. The thrust control ensures a smooth vertical descent by keeping a constant average of the local flow divergence estimates, while the attitude control aligns the vehicle with the inclined surface at touchdown by exploiting their difference. The approach is evaluated in numerical simulations using a simplified 2D spacecraft model across varying slopes and divergence setpoints. Results show that regulating the average divergence yields stable landings with exponential decay of velocity and height, and using the divergence difference enables effective alignment with inclined terrain. Overall, the method offers a robust, low-resource landing strategy that enhances the feasibility of autonomous planetary missions with small spacecraft.
自主着陆在斜坡地形上对小型轻量级航天器(如旋翼飞行器和着陆器)构成了重大挑战。这些车辆的处理能力和载荷容量有限,使得先进的深度学习方法和重型传感器变得不切实际。例如蜜蜂等飞行昆虫,在利用最少的神经和感官资源的情况下,实现了非凡的降落效果,主要依赖于光学流的作用。通过调节流散度(即垂直速度与高度之比),它们能够实现平稳着陆,其中速度和高度呈指数衰减同步下降。然而,将这种仿生策略应用于斜坡地形上的航天器着陆面临着两个关键挑战:全局流散度估计会掩盖地形倾斜,而基于流散度的非线性控制性质会导致传统控制器使用时出现不稳定现象。 本文提出了一种非线性控制策略,该策略利用两种不同的局部流散度估算来在垂直着陆过程中同时调节推力和姿态。这种控制法则基于增量非线性动态反转进行制定,以处理非线性的流散度问题。通过保持本地流散度估计的平均值恒定,推力控制确保平稳下降;而利用这些差异,则可以在接触点时使航天器与倾斜表面对齐的姿态控制得以实现。 这种方法在使用简化2D航天器模型并跨越不同坡度和散度设定点进行数值仿真后进行了评估。结果表明,调节平均流散度可以实现速度和高度呈指数衰减的稳定着陆,并且利用流散度差异可以使航天器与倾斜地形有效对齐。总体而言,该方法提供了一种稳健、低资源消耗的着陆策略,提高了小型航天器自主行星任务的可行性。
https://arxiv.org/abs/2512.04373
Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
文本指导的视频编辑,特别是在对象移除和添加方面,由于需要精确的空间和时间一致性而仍然是一项具有挑战性的任务。现有方法通常依赖于辅助掩码或参考图像来进行编辑引导,这限制了它们的可扩展性和泛化能力。为了解决这些问题,我们提出了LoVoRA(Learnable Object-aware Localization for Video Remasking and Addition),这是一种新的无掩码视频对象移除和添加框架,采用感知物体定位机制。我们的方法利用了一个独特的数据集构建流水线,该流水线结合了图像到视频的转换、基于光流的掩码传播以及视频修复技术,从而实现时间一致性编辑。 LoVoRA的核心创新在于其可学习的对象感知定位机制,这种机制为对象插入和移除任务提供了密集的空间-时间监督。通过利用扩散掩码预测器(Diffusion Mask Predictor),LoVoRA能够实现端到端的视频编辑,并且在推理过程中不需要外部控制信号。大量的实验以及人类评估证明了LoVoRA的有效性和高质量性能。 该技术不仅解决了现有方法中依赖辅助信息的问题,还提高了视频编辑任务中的空间和时间一致性的准确性,为更广泛的视频编辑应用场景提供了可能。
https://arxiv.org/abs/2512.02933
Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media. However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions. We propose PanFlow, a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries. To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations. We also showcase the effectiveness of our method in various applications, including motion transfer and video editing. Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence. Our code, dataset, and models are available at this https URL.
全景视频生成因其在虚拟现实和沉浸式媒体中的应用而引起了越来越多的关注。然而,现有的方法缺乏明确的运动控制,并且难以生成具有大范围复杂运动场景的内容。我们提出了PanFlow,这是一种新颖的方法,它利用了全景图像的球形特性来将高度动态的相机旋转与输入光流条件分离出来,从而能够对大规模和动态变化进行更精确的控制。此外,我们还引入了一种球面噪声扭曲策略,以促进跨越全景边界时运动的一致性。 为了支持有效的训练,我们整理了一个大型、包含丰富运动信息的全景视频数据集,并提供了逐帧的姿态和光流标注。我们在各种应用中展示了我们的方法的有效性,包括动作转移和视频编辑。大量的实验表明,PanFlow在运动保真度、视觉质量和时间一致性方面都显著优于以前的方法。 我们的代码、数据集和模型可在以下链接获得:[此URL](https://this https URL)(请注意,实际使用时应提供有效的URL)。
https://arxiv.org/abs/2512.00832
The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target's binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.
点扩散函数(PSF)是将现实场景与捕获信号联系起来的基本描述符,表现为相机模糊。准确的PSF估计对于光学特性和计算视觉至关重要,但由于基于强度的去卷积固有的歧义和不适定性质而具有挑战性。我们介绍了CircleFlow,这是一种高保真度的PSF估算框架,采用流动引导边缘定位以精确表征模糊。CircleFlow从结构化捕捉开始,通过成像圆形网格目标来编码局部各向异性和空间变化的PSFs,并利用目标二进制亮度先验将图像和核估计解耦。然后,通过光学流引导初始化的二进制结构进行亚像素对齐重建潜在清晰图像,而PSF则被建模为能量受限的隐式神经表示。这两个组件在考虑到拜耳去马赛克处理的可微框架中联合优化,确保由准确边缘定位支持的物理一致性和稳健的PSF估计。在模拟和真实数据上的广泛实验表明,CircleFlow达到了最先进的精度和可靠性,验证了其在实际PSF校准中的有效性。
https://arxiv.org/abs/2512.00796
Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.
最近的视频扩散模型能够生成视觉上吸引人的片段,但常常违背基本物理定律——物体漂浮、加速度异常以及碰撞行为不一致,这揭示了在视觉真实性和物理真实性之间存在着持续的差距。我们提出了$\texttt{NewtonRewards}$,这是一种基于可验证奖励的首个物理基础视频生成后训练框架。与依赖人类或VLM(视觉语言模型)反馈不同,$\texttt{NewtonRewards}$使用冻结的效用模型从生成的视频中提取可测量的代理:光流作为速度的代理,而高级外观特征则作为质量的代理。这些代理能够通过两种互补奖励明确地强制执行牛顿结构:一种是强制恒定加速度动力学的牛顿运动约束奖励,另一种则是防止简单退化解的质量守恒奖励。 我们在五个牛顿运动原语(自由落体、水平/抛物线投掷以及斜坡上下滑动)上评估了$\texttt{NewtonRewards}$,使用我们新构建的大规模基准$\texttt{NewtonBench-60K}$。在所有原始数据的视觉和物理指标中,$\texttt{NewtonRewards}$始终改善了物理合理性、运动平滑度以及时间一致性,在先前的后训练方法之上表现优异。它还能够在外部分布(如高度、速度和摩擦力变化)下保持强大的性能。 我们的结果表明,基于物理学的可验证奖励为实现物理感知视频生成提供了一条可扩展的道路。
https://arxiv.org/abs/2512.00425
Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate's quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras' propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.
视觉里程计技术通常依赖于从一系列图像中提取特征,随后计算光流。这种连续两帧之间的一点对一点的对应关系难以计算且准确性不一,这会影响里程估计的质量。为克服由于对应问题带来的困难,人们采用线性特征并融合其他传感器(如事件相机、IMU)来提升性能的方法,其中很多方法仍然严重依赖于对应的准确性。 当摄像机在移动过程中观察到一条直线时,该直线的图像会在图像时间空间中扫过一个平滑的表面。这是一个可展曲面,对其形状的分析可以提供有关里程计的信息。此外,这种估计只需要从点到线关联中的差分计算更新。 受事件相机对边缘检测优势的启发,这项研究提出了一种新的算法,用于从这些可展表面上重建3D场景和视觉里程信息。通过使用机载IMU传感器提供的惯性测量数据来约束该表面,可以极大地减少解空间的维度。
https://arxiv.org/abs/2512.00327
We propose a method to reconstruct dynamic fire in 3D from a limited set of camera views with a Gaussian-based spatiotemporal representation. Capturing and reconstructing fire and its dynamics is highly challenging due to its volatile nature, transparent quality, and multitude of high-frequency features. Despite these challenges, we aim to reconstruct fire from only three views, which consequently requires solving for under-constrained geometry. We solve this by separating the static background from the dynamic fire region by combining dense multi-view stereo images with monocular depth priors. The fire is initialized as a 3D flow field, obtained by fusing per-view dense optical flow projections. To capture the high frequency features of fire, each 3D Gaussian encodes a lifetime and linear velocity to match the dense optical flow. To ensure sub-frame temporal alignment across cameras we employ a custom hardware synchronization pattern -- allowing us to reconstruct fire with affordable commodity hardware. Our quantitative and qualitative validations across numerous reconstruction experiments demonstrate robust performance for diverse and challenging real fire scenarios.
https://arxiv.org/abs/2511.22459
Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.
https://arxiv.org/abs/2511.22330
Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.
https://arxiv.org/abs/2511.22167
The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.
MARWIN机器人在欧洲XFEL(X射线自由电子激光器)中运行,用于执行长且单调的加速器隧道中的自主辐射监测。这些地方常规定位方法难以应对。目前,MARWIN的导航概念结合了基于激光雷达的边缘检测、车轮/激光雷达里程计以及周期性二维码参考,并采用模糊控制来调节与墙壁的距离、旋转和纵向位置。尽管这种方法在预定义区域中表现稳健,但在未知几何形状和障碍物环境中缺乏灵活性。 本文探讨了一种专注于使用深度视觉立体测距(DVSO)及其3D几何约束的替代方案。DVSO是一种纯视觉方法,利用立体视差、光学流以及自我监督学习来估计深度和自身运动,无需标记数据。为了实现全局一致性,可以将DVSO与绝对参考点(如地标)或其他传感器融合。 我们使用欧洲XFEL作为案例研究,对加速器隧道环境中的这种概念进行了评估。预期的好处包括通过立体视差减少尺度漂移、低成本感应以及可扩展的数据采集能力。然而,在低纹理表面、光照变化、计算负载和辐射影响下的稳健性等问题仍待解决。 本文界定了一个研究议程,以使MARWIN能够在其受控的安全关键基础设施中实现更高级别的自主导航能力。
https://arxiv.org/abs/2512.00080
Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.
https://arxiv.org/abs/2511.20020
Private lunar missions are faced with the challenge of robust autonomous navigation while operating under stringent constraints on mass, power, and computational resources. This work proposes a motion-field inversion framework that uses optical flow and rangefinder-based depth estimation as a lightweight CPU-based solution for egomotion estimation during lunar descent. We extend classical optical flow formulations by integrating them with depth modeling strategies tailored to the geometry for lunar/planetary approach, descent, and landing, specifically, planar and spherical terrain approximations parameterized by a laser rangefinder. Motion field inversion is performed through a least-squares framework, using sparse optical flow features extracted via the pyramidal Lucas-Kanade algorithm. We verify our approach using synthetically generated lunar images over the challenging terrain of the lunar south pole, using CPU budgets compatible with small lunar landers. The results demonstrate accurate velocity estimation from approach to landing, with sub-10% error for complex terrain and on the order of 1% for more typical terrain, as well as performances suitable for real-time applications. This framework shows promise for enabling robust, lightweight on-board navigation for small lunar missions.
https://arxiv.org/abs/2511.17720
Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.
https://arxiv.org/abs/2511.16948
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
https://arxiv.org/abs/2511.16542
This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.
https://arxiv.org/abs/2511.16535
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.
https://arxiv.org/abs/2511.16407
Event cameras, by virtue of their working principle, directly encode motion within a scene. Many learning-based and model-based methods exist that estimate event-based optical flow, however the temporally dense yet spatially sparse nature of events poses significant challenges. To address these issues, contrast maximization (CM) is a prominent model-based optimization methodology that estimates the motion trajectories of events within an event volume by optimally warping them. Since its introduction, the CM framework has undergone a series of refinements by the computer vision community. Nonetheless, it remains a highly non-convex optimization problem. In this paper, we introduce a novel biologically-inspired hybrid CM method for event-based optical flow estimation that couples visual and inertial motion cues. Concretely, we propose the use of orientation maps, derived from camera 3D velocities, as priors to guide the CM process. The orientation maps provide directional guidance and constrain the space of estimated motion trajectories. We show that this orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. The evaluation of our approach on the MVSEC, DSEC, and ECD datasets yields superior accuracy scores over the state of the art.
https://arxiv.org/abs/2511.12961
In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at this https URL
https://arxiv.org/abs/2511.12671