Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
磁共振成像(MRI)能够进行非侵入性的高分辨率肌肉结构分析。然而,自动分割仍然受限于高昂的计算成本、对大规模训练数据集的依赖以及在较小肌肉分割时准确度降低的问题。基于卷积神经网络(CNN)的方法虽然强大,但常常面临显著的计算开销、泛化能力有限和解释性较差等问题,尤其是在面对多样化的人群时。 本研究提出了一种基于关键点跟踪的无需训练的分割方法,该方法结合了关键点选择与Lucas-Kanade光流算法。所提出的这种方法在不同关键点选择策略下可以达到0.6到0.7之间的平均Dice相似系数(DSC),其表现可媲美最先进的CNN基模型,同时大大降低了计算需求并增强了解释性。 该可扩展框架为临床和研究应用中的肌肉分割提供了一个稳健且易于理解的替代方案。
https://arxiv.org/abs/2507.08690
This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.
这项工作提出了PanMatch,这是一种灵活的基础模型,用于稳健的对应匹配。与之前依赖于特定任务架构和领域特定微调的方法(例如立体匹配、光流或特征匹配)不同,我们的关键见解是任何两帧对应的匹配任务都可以在2D位移估计框架内通过相同的模型权重解决。这种形式化消除了设计专门统一架构或特定任务集成模型的需求,并通过赋予位移估计算法前所未有的泛化能力来实现多任务整合。 为此,我们强调了跨多个领域和任务的稳健特征提取器的重要性,并提出了一个特征转换管道,该管道利用大型视觉模型中的通用功能,使匹配基线具备零样本跨视图匹配的能力。此外,我们还组装了一个包含近180万样本的跨域数据集,这些样本来自立体匹配、光流和特征匹配领域,用于预训练PanMatch。 我们在广泛的领域和下游任务中展示了使用相同模型权重时PanMatch的多功能性。我们的模型在跨任务评估中超越了UniMatch和Flow-Anything,并在面向特定任务的基准测试中达到了与大多数最先进的特定任务算法相当的表现水平。此外,在异常场景(如雨天或卫星图像)下,PanMatch展示出了前所未有的零样本性能,而大多数现有的稳健算法在这种情况下通常无法提供有意义的结果。
https://arxiv.org/abs/2507.08400
Integration of hyperspectral imaging into fluorescence-guided neurosurgery has the potential to improve surgical decision making by providing quantitative fluorescence measurements in real-time. Quantitative fluorescence requires paired spectral data in fluorescence (blue light) and reflectance (white light) mode. Blue and white image acquisition needs to be performed sequentially in a potentially dynamic surgical environment. A key component to the fluorescence quantification process is therefore the ability to find dense cross-modal image correspondences between two hyperspectral images taken under these drastically different lighting conditions. We address this challenge with the introduction of X-RAFT, a Recurrent All-Pairs Field Transforms (RAFT) optical flow model modified for cross-modal inputs. We propose using distinct image encoders for each modality pair, and fine-tune these in a self-supervised manner using flow-cycle-consistency on our neurosurgical hyperspectral data. We show an error reduction of 36.6% across our evaluation metrics when comparing to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Our code and models will be made publicly available after the review process.
将高光谱成像技术集成到荧光引导神经外科手术中,有望通过提供实时的定量荧光测量来改善手术决策。定量荧光需要在荧光(蓝光)和反射率(白光)模式下采集配对的光谱数据。由于手术环境可能具有动态性,在这种截然不同的照明条件下获取蓝色和白色图像必须依次进行。因此,荧光量化过程的一个关键组成部分是能够在两种高光谱图像之间找到密集的跨模态图像对应关系。 为了解决这一挑战,我们引入了X-RAFT模型,这是一种针对跨模态输入进行了修改的递归全对场变换(Recurrent All-Pairs Field Transforms, RAFT)光学流模型。我们建议为每一对模式使用不同的图像编码器,并通过在神经外科高光谱数据上进行自我监督的方式微调这些编码器,以保持流动循环一致性。当与基准方法相比时,我们在评估指标上的误差减少了36.6%,而与现有的跨模态光学流方法(CrossRAFT)相比则减少了27.83%。 我们的代码和模型将在审稿流程后公开发布。
https://arxiv.org/abs/2507.07747
In this paper, we present a novel framework for extracting underlying crowd motion patterns and inferring crowd semantics using mmWave radar. First, our proposed signal processing pipeline combines optical flow estimation concepts from vision with novel statistical and morphological noise filtering to generate high-fidelity mmWave flow fields - compact 2D vector representations of crowd motion. We then introduce a novel approach that transforms these fields into directed geometric graphs, where edges capture dominant flow currents, vertices mark crowd splitting or merging, and flow distribution is quantified across edges. Finally, we show that by analyzing the local Jacobian and computing the corresponding curl and divergence, we can extract key crowd semantics for both structured and diffused crowds. We conduct 21 experiments on crowds of up to (and including) 20 people across 3 areas, using commodity mmWave radar. Our framework achieves high-fidelity graph reconstruction of the underlying flow structure, even for complex crowd patterns, demonstrating strong spatial alignment and precise quantitative characterization of flow split ratios. Finally, our curl and divergence analysis accurately infers key crowd semantics, e.g., abrupt turns, boundaries where flow directions shift, dispersions, and gatherings. Overall, these findings validate our framework, underscoring its potential for various crowd analytics applications.
在这篇论文中,我们提出了一种新的框架,用于通过毫米波雷达(mmWave雷达)提取潜在的人群运动模式并推断人群语义。首先,我们提出的信号处理管道结合了视觉中的光流估计概念,并采用新颖的统计和形态噪声过滤技术来生成高保真度的毫米波流动场——即紧凑的二维向量表示形式,用于描述人群的运动情况。 其次,我们介绍了一种新的方法,该方法将这些流动场转换为有向几何图,在这种方法中,边表示主要的流动方向,顶点标记出人流的分叉或汇聚处,并且流动分布在整个边缘上进行了量化。最后,通过分析局部雅可比矩阵并计算相应的旋度和散度,我们可以提取结构化人群和扩散人群中关键的人群语义信息。 我们在三个不同区域对多达(包括)20人的群体进行了21项实验,并使用了商用毫米波雷达设备。我们的框架能够以高保真度重构潜在的流动结构图,即使对于复杂的人员流动模式也是如此,这表明在空间对齐和精确量化流动分叉比方面表现出色。此外,通过旋度和散度分析,可以准确地推断出关键的人群语义信息,例如突然转向、流向改变的边界、扩散事件以及聚集行为。 总体而言,这些发现验证了我们框架的有效性,并突显其在多种人群数据分析应用中的潜力。
https://arxiv.org/abs/2507.07331
Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.
人类运动的固有复杂性,包括非刚体变形、关节活动、衣物扭曲以及由于肢体或其他个体造成的频繁遮挡,为训练稳健且通用的点跟踪器提供了丰富而具有挑战性的监督来源。尽管人体运动非常适合此类任务,但由于手动标注工作量大,获取广泛的训练数据仍然困难重重。 我们提出的AnthroTAP管道通过提出一个自动化流程来生成伪标签训练数据,解决了这个问题,该流程利用了Skinned Multi-Person Linear (SMPL)模型。具体而言,我们首先将SMPL模型拟合到视频帧中检测到的人体上,然后将生成的3D网格顶点投影到2D图像平面上以产生伪轨迹,并使用光线投射技术处理遮挡问题。此外,基于光流一致性原则过滤掉不可靠的跟踪结果。 在AnthroTAP标注的数据集上训练的点跟踪模型,在TAP-Vid基准测试中达到了最先进的性能水平。与那些需要256个GPU进行长时间训练并在真实视频数据上达到最佳表现的模型相比,我们的模型仅使用4个GPU,并且在一个工作日内就完成了10,000倍少的数据量的学习过程。 这一成果表明了在点跟踪任务中利用伪标签生成技术的有效性以及其在提高计算效率和减少标注成本方面的潜力。
https://arxiv.org/abs/2507.06233
Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.
音频驱动的面部动画生成对于虚拟助手、视频游戏和电影等应用至关重要,特别是在这些场景中自然的嘴唇动作是必不可少的。尽管该领域已经取得了一些进展,但在产生既连贯又逼真的面部动画方面仍然面临着挑战。现有的方法通常基于GAN或UNet扩散模型,面临三大限制:(i) 由于时间约束较弱导致的时间抖动问题,这会导致帧间不一致性;(ii) 因为提取的3D信息不足而导致的身份漂移,从而影响面部身份的一致性保持;以及(iii) 因为对自然眨眼动态建模不足而造成的不自然眨眼行为。为了克服这些挑战,我们提出了MoDiT这一新颖框架,它结合了三维可变形模型(3DMM)和基于扩散的Transformer。 我们的贡献包括: - 一种分层去噪策略,采用修正的时间注意力机制和偏置自/交叉注意力机制,使得模型能够优化唇部同步,并逐步增强全脸的一致性,有效缓解时间抖动问题。 - 将3DMM系数整合进来以提供明确的空间约束条件,确保准确的基于3D信息的光流预测,并利用Wav2Lip的结果改进唇部同步,从而保持身份一致性。 - 一种优化的眨眼策略用于模拟自然的眼部运动,使得眨眼行为更加流畅和逼真。 通过这些创新方法,MoDiT能够更有效地解决现有技术面临的问题,生成既连贯又自然的面部动画。
https://arxiv.org/abs/2507.05092
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: this https URL.
视频帧插值(VFI)的目标是基于两个连续的邻近帧$I_0$和$I_1$预测中间帧$I_n$。最近的方法将扩散模型(包括图像型和视频型)应用于此任务,并取得了很好的效果。然而,基于图像的扩散模型无法提取时间信息,并且与非扩散方法相比效率较低。而基于视频的扩散模型可以提取时间信息,但它们在训练规模、模型大小以及推理时间方面过大。 为了缓解上述问题,我们提出了一种称为“时空感知隐变量布朗桥扩散视频帧插值”(TLB-VFI)的方法,这是一种高效的基于视频的扩散模型。通过使用我们提出的3D小波门控和时域感知自编码器从视频输入中提取丰富的时序信息,我们的方法在最具挑战性的数据集上比最近的最佳图像型扩散模型提高了20%的FID分数(一种衡量生成图像质量的标准)。同时,由于存在丰富的时序信息,我们的方法即使参数量减少了三倍也能取得强性能。这种参数减少带来了大约2.3倍的速度提升。通过引入光学流指导,我们的方法只需要视频型扩散模型1/9000的数据进行训练,并且参数数量少于其1/20。 代码和结果可以在我们的项目页面上找到:[此处插入具体网址]。
https://arxiv.org/abs/2507.04984
Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV's sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.
粒子图像测速法(PIV)在流体动力学中至关重要,然而深度学习的应用却面临重大挑战。一个关键的缺口在于:缺乏对各种光流模型在PIV数据上性能进行综合评估的方法,这主要是由于可用数据集有限以及缺乏标准化基准造成的。这种现状阻碍了公平比较和进步。为了解决这一问题,我们的主要贡献是一个新颖的大规模合成PIV基准数据集,该数据集由多种计算流体动力学(CFD)模拟生成(包括JHTDB和Blasius)。该数据集中包含了前所未有的颗粒密度、流动速度以及连续运动的多样性,首次使得各种光流算法和PIV方法能够进行标准化且严格的评估。此外,我们提出了多成本体积粒子图像测速法(MCFormer),这是一种新的深度网络架构,专门利用多帧时间信息和多个成本体积来应对PIV数据稀疏性的问题。 我们的全面基准测试是同类中的首次尝试,揭示了适应性光流模型之间存在显著的性能差异,并证明了MCFormer在总体归一化端点误差(NEPE)上大幅超越现有方法。这项工作不仅提供了未来PIV研究所需的基础基准资源,还提供了一种针对PIV挑战量身定制的最先进的方法。我们将我们的基准数据集和代码公开发布,以促进该领域的进一步研究。
https://arxiv.org/abs/2507.04750
The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.
4D医学成像的时间插值任务在呼吸运动建模的临床实践中扮演着至关重要的角色。现有方法通常采用基于光流模型的方法来插值中间帧,而这些方法是根据简化的线性运动假设提出的。然而,在现实世界中,呼吸运动应该是非线性和准周期性的,并具有特定频率。鉴于这一特性,我们从频域角度解决时间插值任务,并提出了一种名为FB-Diff的傅里叶基引导扩散模型。 具体来说,考虑到呼吸运动具有的规则性,引入了生理运动先验来描述时序数据分布的一般特征。随后精心设计了一个傅立叶运动算子,通过在变分自动编码器的特征空间中结合生理运动先验和特定病例的频谱信息来提取傅里叶基。这些学习到的傅里叶基可以更好地模拟具有特定频率模式的呼吸运动。 基于起始帧和结束帧,在扩散模型的基础上进一步利用了这些好的傅立叶基,通过设计基础交互算子以生成方式促进了时间插值任务的发展。大量的实验结果表明,FB-Diff在感知性能方面达到了最先进的(SOTA)水平,并且具有更好的时序一致性同时保持了有前景的重建指标。 代码现已提供。
https://arxiv.org/abs/2507.04547
Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch's output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.
变化检测通常涉及识别在相同位置拍摄的双时态图像之间的变化区域。除了显著的变化之外,双时态图像中的缓慢变化在实际场景中也很重要。例如,在斜坡、大坝和尾矿库等情况下,微弱的变化往往预示着重大危险的到来。因此,设计一种能够同时检测缓慢变化和快速变化的变体检测网络提出了一个新的挑战。在这篇论文中,为了应对这一挑战,我们提出了一种名为Flow-CDNet的变化检测网络,该网络包括两个分支:光流分支和二进制变化检测分支。第一个分支利用金字塔结构在多个尺度上提取位移变化。第二个分支结合基于ResNet的网络与光流分支的输出以生成快速变化的结果。随后,为了监督和评估这一新的变化检测框架,我们构建了一个自建的变化检测数据集Flow-Change,并设计了一种结合二进制Tversky损失和L2范数损失的损失函数,以及一个新的评价指标FEPE(未详细定义)。在Flow-Change数据集上进行的定量实验表明,我们的方法优于现有方法。此外,消融实验验证了这两个分支可以相互促进以增强检测性能。
https://arxiv.org/abs/2507.02307
3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: this https URL.
三维医学图像生成对于数据增强和保护患者隐私至关重要,因此需要可靠且高效的模型以适用于临床实践。然而,当前的方法存在解剖学精确度有限、轴向长度受限以及计算成本高昂等问题,这使得这些方法在资源和技术基础设施相对匮乏的地区难以实施。我们在此介绍TRACE框架,它利用二维多模态条件下的扩散方法生成与时空对齐的三维医学图像。TRACE将连续的2D切片视为视频帧对,并结合分割先验和放射学报告来实现解剖结构对齐;同时,通过引入光流技术以维持时间一致性。在推理阶段,采用重叠帧策略将这些帧对连接成一个可变长度序列,并最终重建为时空及解剖学上均一致的三维体积图像。实验结果表明,TRACE能够有效地平衡计算效率与保持解剖精确度和时空连续性的需求。代码可在以下链接获取:[此 URL](请使用实际提供的URL)。
https://arxiv.org/abs/2507.00802
For robots to move in the real world, they must first correctly understand the state of its own body and the tools that it holds. In this research, we propose DIJE, an algorithm to estimate the image Jacobian for every pixel. It is based on an optical flow calculation and a simplified Kalman Filter that can be efficiently run on the whole image in real time. It does not rely on markers nor knowledge of the robotic structure. We use the DIJE in a self-recognition process which can robustly distinguish between movement by the robot and by external entities, even when the motion overlaps. We also propose a visual servoing controller based on DIJE, which can learn to control the robot's body to conduct reaching movements or bimanual tool-tip control. The proposed algorithms were implemented on a physical musculoskeletal robot and its performance was verified. We believe that such global estimation of the visuomotor policy has the potential to be extended into a more general framework for manipulation.
为了使机器人能够在真实世界中移动,它们必须首先正确理解自身状态及其持有的工具的状态。在这项研究中,我们提出了一种名为DIJE的算法,该算法用于估计每个像素的图像雅可比矩阵。DIJE基于光学流计算和一个简化的卡尔曼滤波器,后者可以实时高效地在整个图像上运行。它不需要标记物也不依赖于机器人的结构知识。我们在自我识别过程中使用了DIJE,这使得机器人能够稳健地区分自身运动与其他外部实体的运动,即使这些运动重叠时也能区分清楚。我们还提出了一种基于DIJE的视觉伺服控制器,该控制器可以学习控制机器人的身体以执行抓取动作或双臂工具尖端的操作控制。所提出的算法在物理肌骨机器人上进行了实现,并对其性能进行了验证。我们认为这种全局化的视动策略估计方法具有潜在可能被扩展为更通用的手部操作框架的基础。
https://arxiv.org/abs/2507.00446
Computer vision techniques have the potential to improve the diagnostic performance of colonoscopy, but the lack of 3D colonoscopy datasets for training and validation hinders their development. This paper introduces C3VDv2, the second version (v2) of the high-definition Colonoscopy 3D Video Dataset, featuring enhanced realism designed to facilitate the quantitative evaluation of 3D colon reconstruction algorithms. 192 video sequences were captured by imaging 60 unique, high-fidelity silicone colon phantom segments. Ground truth depth, surface normals, optical flow, occlusion, six-degree-of-freedom pose, coverage maps, and 3D models are provided for 169 colonoscopy videos. Eight simulated screening colonoscopy videos acquired by a gastroenterologist are provided with ground truth poses. The dataset includes 15 videos featuring colon deformations for qualitative assessment. C3VDv2 emulates diverse and challenging scenarios for 3D reconstruction algorithms, including fecal debris, mucous pools, blood, debris obscuring the colonoscope lens, en-face views, and fast camera motion. The enhanced realism of C3VDv2 will allow for more robust and representative development and evaluation of 3D reconstruction algorithms.
计算机视觉技术有潜力提升结肠镜检查的诊断性能,但缺乏用于训练和验证的3D结肠镜数据集阻碍了其发展。本文介绍了C3VDv2,即高清晰度结肠镜3D视频数据集的第二版(v2),该版本增强了现实感设计,旨在促进三维结肠重建算法的定量评估。通过成像60个独特的、高度逼真的硅胶结肠模型片段,采集了192段视频序列。对于其中的169段结肠镜检查视频提供了地面真实深度信息、表面法线、光学流、遮挡情况、六自由度姿态、覆盖图和三维模型等数据。另外还提供了由胃肠病医生获取的八段模拟筛查结肠镜检查视频,这些视频附有真实的姿态数据。该数据集还包括了15段展示结肠变形的视频,用于定性评估。 C3VDv2模拟了多种复杂场景以挑战3D重建算法,包括粪便残渣、黏液池、血液、遮挡内窥镜镜头的碎片、正面视图以及快速相机移动。C3VDv2增强的真实感将有助于开发和评估更加稳健且具有代表性的三维重建算法。
https://arxiv.org/abs/2506.24074
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: this https URL.
全景光学流能够实现对宽视野中时间动态的全面理解。然而,由球面到平面投影(如等矩形投影(ERP))引起的严重失真会显著降低传统基于透视的方法在光学流估计中的性能,特别是在极地区域。为了解决这一挑战,我们提出了PriOr-Flow,这是一种新颖的双分支框架,利用正交视图低畸变的特点来增强这些区域内的光学流估算。具体而言,我们引入了双重成本协作查找(DCCL)操作符,它从原始和正交成本体中联合检索相关信息,在构建成本体积时有效地减少了失真噪声。此外,我们的正交驱动失真补偿(ODDC)模块迭代地细化来自两个分支的运动特征,进一步抑制极区失真。广泛的实验表明,PriOr-Flow与各种基于透视的迭代光学流方法兼容,并在公开可用的全景光学流数据集中持续表现出最先进的性能,为宽视野运动估算设立了新的基准。代码可在此网址获取:this https URL.
https://arxiv.org/abs/2506.23897
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at this https URL.
最近的光流估计进展优先考虑了精度,但代价是GPU内存消耗不断增加,尤其是在处理高分辨率(全高清)输入时。我们引入了一种名为MEMFOF的记忆高效多帧光流方法,该方法在多帧估计和GPU内存使用之间找到了一个有利的权衡点。值得注意的是,对于1080p输入,MEMFOF只需要2.09GB的GPU运行时内存,并且训练过程中的内存消耗为28.5GB,这使我们的方法能够在原生1080p分辨率下进行训练而无需裁剪或降采样。 我们系统地重新审视了与RAFT架构类似的设计选择,结合减少了的相关体积和高分辨率训练协议以及多帧估计的方法,从而在多个基准测试中达到了最先进的性能,并显著降低了内存开销。我们的方法在精度和运行时效率上都优于资源消耗更大的替代方案,验证了其在高分辨率下的稳健性。 提交时,我们的方法在Spring基准测试中以3.289的1像素(1px)异常率排名第一,在Sintel(干净版)中以0.963的终点误差(EPE)领先,并且在KITTI-2015上取得了最佳的Fl-all误差2.94%。代码可在以下链接获取:[此处插入实际链接]
https://arxiv.org/abs/2506.23151
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at this https URL.
我们介绍了Warping-Alone Field Transforms(WAFT),这是一种简单而有效的方法,用于光流分析。WAFT与RAFT类似,但用高分辨率的图像扭曲操作替代了成本体(volume of cost),从而在降低内存消耗的同时提高了准确性。这一设计挑战了传统的观点,即构建成本体积是实现高性能所必需的。WAFT是一种简单且灵活的元架构,具有最小的归纳偏差和对定制设计的依赖性。 与现有方法相比,在Spring和KITTI基准测试中,WAFT排名第一,并在KITTI上实现了最佳的零样本泛化能力,同时比性能相似的方法快4.1倍。代码和模型权重可在该链接(此URL)获取。
https://arxiv.org/abs/2506.21526
Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.
在手术场景(如内窥镜)中,高效的三维重建和实时可视化至关重要。近年来,3D高斯点阵法 (3DGaussian Splatting, 3DGS) 在高效三维重建和渲染方面表现出了卓越的性能。大多数基于3DGS的同时定位与地图构建(Simultaneous Localization and Mapping, SLAM)方法仅依赖于外观约束来优化3DGS以及相机姿态。然而,在内窥镜场景中,非朗伯表面引起的光度不一致性和呼吸导致的动态运动会对SLAM系统的性能产生挑战。 为了解决这些问题,我们额外引入了光学流损失作为几何约束,该约束有效限制了场景中的三维结构和摄像机移动。此外,我们提出了一种深度正则化策略来缓解光度不一致性的问题,并确保3DGS在内窥镜场景中渲染的深度的有效性。为了提高SLAM系统中的场景表示,我们改进了3DGS优化策略,重点关注关键帧对应的子优图像视点,从而获得更好的渲染效果。 在C3VD静态数据集和StereoMIS动态数据集上的广泛实验表明,我们的方法在新视角合成和姿态估计方面超越现有的最先进的技术,在静态和动态手术场景中都表现出色。论文被接受后,源代码将公开发布。
https://arxiv.org/abs/2506.21420
Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing cues, enriching feature representations without increasing computational overhead. To focus on action-relevant regions beyond raw pixels, we introduce two novel domain-specific descriptors. Object Detection Features (ODF) aggregate outputs from multiple object detectors to capture contextual cues, while Saliency Detection Features (SDF) highlight spatial and intensity patterns crucial for action recognition. Our framework seamlessly integrates these descriptors with auxiliary modalities such as optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It remains compatible with state-of-the-art architectures, including I3D, AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE V2 and InternVideo2. To handle uncertainty in auxiliary features, we incorporate aleatoric uncertainty modeling in the hallucination step and introduce a robust loss function to mitigate feature noise. Our multimodal self-supervised action recognition framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
理解视频中的人类动作不仅仅是对原始像素的分析;它还需要高层次的语义推理和有效整合多模态特征。我们提出了一种深度转换式动作识别框架,通过共同预测RGB视频帧的动作概念和辅助特征来提高识别准确性。在测试时,幻觉流可以推断缺失的线索,丰富特征表示而不增加计算开销。为了超越原始像素关注与动作相关区域,我们引入了两种新颖的领域特定描述符:物体检测特征(ODF)聚合来自多个对象检测器的输出以捕捉上下文线索;而注意力检测特征(SDF)则突出空间和强度模式,这些模式对动作识别至关重要。我们的框架可以无缝地将这些描述符与光流、改进的密集轨迹、骨骼数据以及音频提示等辅助模态整合在一起,并且兼容包括I3D、AssembleNet、视频变换网络(Video Transformer Network)、FASTER以及最近的模型如VideoMAE V2和InternVideo2在内的最先进架构。为了处理辅助特征中的不确定性,我们在幻觉步骤中引入了经验不确定性建模,并提出了一种稳健的损失函数以减少特征噪声。我们的多模态自监督动作识别框架在Kinetics-400、Kinetics-600以及Something-Something V2等多个基准测试上取得了最先进的性能,证明其能够有效捕捉细粒度的动作动态变化。
https://arxiv.org/abs/2506.20342
Tissue deformation recovery based on stereo endoscopic images is crucial for tool-tissue interaction analysis and benefits surgical navigation and autonomous soft tissue manipulation. Previous research suffers from the problems raised from camera motion, occlusion, large tissue deformation, lack of tissue-specific biomechanical priors, and reliance on offline processing. Unlike previous studies where the tissue geometry and deformation are represented by 3D points and displacements, the proposed method models tissue geometry as the 3D point and derivative map and tissue deformation as the 3D displacement and local deformation map. For a single surface point, 6 parameters are used to describe its rigid motion and 3 parameters for its local deformation. The method is formulated under the camera-centric setting, where all motions are regarded as the scene motion with respect to the camera. Inter-frame alignment is realized by optimizing the inter-frame deformation, making it unnecessary to estimate camera pose. The concept of the canonical map is introduced to optimize tissue geometry and deformation in an online approach. Quantitative and qualitative experiments were conducted using in vivo and ex vivo laparoscopic datasets. With the inputs of depth and optical flow, the method stably models tissue geometry and deformation even when the tissue is partially occluded or moving outside the field of view. Results show that the 3D reconstruction accuracy in the non-occluded and occluded areas reaches 0.37$\pm$0.27 mm and 0.39$\pm$0.21 mm in terms of surface distance, respectively. The method can also estimate surface strain distribution during various manipulations as an extra modality for mechanical-based analysis.
基于立体内窥镜图像的组织变形恢复对于工具-组织交互分析至关重要,能够改善手术导航和自主软组织操作。然而,以往的研究面临摄像机运动、遮挡、组织大范围变形、缺乏特定生物力学先验知识以及依赖于离线处理等问题。不同于之前研究中用3D点和位移来表示组织几何形状和变形的方式,所提出的方法将组织的几何结构建模为3D点及其导数图,并且将组织的变形建模为3D位移及局部变形图。对于单个表面点,使用6个参数描述其刚性运动,并用3个参数描述其局部变形。该方法以摄像机为中心设置进行构建,在此框架下所有运动都被视为相对于摄像机的场景运动。通过优化帧间变形实现帧间对齐,从而无需估计摄像机姿态。引入“标准映射”的概念来在线优化组织几何形状和变形。 使用体内和体外腹腔镜数据集进行了定量和定性实验,在输入深度信息和光流的情况下,即使在部分遮挡或运动超出视野范围时,该方法也能稳定地建模组织的几何形状和变形。结果显示,在未被遮挡和被遮挡区域中的3D重建准确度分别为0.37±0.27毫米和0.39±0.21毫米(以表面距离为衡量标准)。此外,该方法还可以估算在各种操作期间的表面应变分布,作为基于力学分析的一个额外模态。
https://arxiv.org/abs/2506.19388
Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational this http URL overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark this http URL code and datasets are available at this https URL.
视觉SLAM在受噪声影响、光照变化以及黑暗环境下的表现特别具有挑战性。基于学习的光流算法可以通过利用多模态信息来应对这些挑战,但传统的基于光流的视觉SLAM方法往往需要大量的计算资源才能克服这一限制。为了克服这个局限性,我们提出了FMF-SLAM(快速傅里叶变换多模态融合SLAM),这是一种高效的多模态融合定位和建图算法,它利用快速傅里叶变换来提高算法效率。具体来说,我们引入了一种新的基于傅里叶的自注意力机制以及跨模态注意机制,用于从RGB信号和深度信号中提取特征。此外,通过在各尺度上进行跨模态知识蒸馏,我们也加强了多模态特征之间的交互作用。 为了展示FMF-SLAM在实际场景中的可行性并保持实时性能,我们将其与一个安防机器人集成,并融合了全球定位模块GNSS-RTK和全局捆绑调整。我们的方法通过使用TUM、TartanAir以及我们自己的真实数据集进行了验证,在嘈杂环境、光照变化大及黑暗环境下均表现出领先于现有技术的性能。 有关代码和数据集可在此网址获取:[请提供具体的URL](假设原文本中的 "this https URL" 应为实际链接地址)。
https://arxiv.org/abs/2506.18204