Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{this https URL}{CamPilot Page}.
近期,在相机控制视频扩散模型方面的进展显著提高了视频与摄像机之间的对齐精度。然而,摄像机的可控性仍然有限。在这项工作中,我们基于奖励反馈学习(Reward Feedback Learning)方法,并致力于进一步提升摄像机的可控性。不过,直接借用现有的ReFL方法会遇到几个挑战:首先,当前的奖励模型缺乏评估视频与摄像机对齐能力的能力;其次,在计算奖励时将潜在变量解码为RGB视频带来了大量的计算开销;第三,在视频解码过程中通常忽略了3D几何信息。 为了应对这些局限性,我们引入了一个高效的感知相机的3D解码器,该解码器能够将视频潜变量解码成用于奖励量化的3D表示。具体而言,视频潜在编码与摄像机姿态一起被解码为3D高斯分布,在这一过程中,摄像机姿态不仅作为输入,还充当投影参数的角色。如果视频潜在变量和摄像机姿态之间存在对齐误差,则会导致3D结构的几何失真,并进而导致渲染模糊。 基于该特性,我们明确地优化了合成视角与真实视图之间的像素级一致性作为奖励计算的基础。考虑到这一随机性质,我们进一步引入了一个可见性项,仅针对通过几何变形导出的确定区域进行监督。在RealEstate10K和WorldScore基准上的广泛实验验证了所提出方法的有效性。 项目页面:\[链接\](请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.16214
Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: this https URL
机器人基础模型已经开始实现通用型机器人代理的承诺,但其进展仍受限于大规模现实世界操作数据集的稀缺。仿真和合成数据生成提供了一种可扩展的替代方案,但由于模拟与现实之间的视觉领域差距,它们的有效性受到了限制。在本文中,我们介绍了Point Bridge框架,该框架利用统一、无领域的点基表示法来解锁合成数据集以实现零样本仿真实现策略迁移,而无需显式的视觉或对象级别的对齐。通过结合基于视觉-语言模型(VLMs)的自动点基表示提取、基于变压器的学习策略以及高效的推理时间管道,Point Bridge能够仅使用合成数据训练具备能力的真实世界操作代理。在额外与少量实际演示进行共训的情况下,Point Bridge进一步提高了性能,并显著超越了先前基于视觉的仿真实现共训方法的表现。在零样本仿真到现实迁移中,它最多可实现44%的增长,在具有有限真实数据的情境下跨单一任务和多任务设置则可达66%。 请注意查看机器人视频的最佳方式是访问此链接:[请在此插入实际URL](原文中的“this https URL”应为具体的网址链接)。
https://arxiv.org/abs/2601.16212
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: this http URL
许多视觉-语言-动作(VLA)模型将图像补丁展平为一维标记序列,从而削弱了进行精确操作所需的二维空间线索。我们提出了一种轻量级、无需训练的方法IVRA,该方法通过利用内置视觉编码器中已有的亲和性提示来改进对空间的理解,而不需要任何外部编码器或重新训练。IVRA选择性地将这些亲和信号注入包含实例级特征的语言模型层中。这种推理时的干预措施能够重新调整视觉标记之间的相互作用,并更好地保持几何结构的同时固定所有模型参数不变。 我们通过将其应用于多种VLA架构(包括LLaRA、OpenVLA及FLOWER)在跨越2D和3D操作(如VIMA和LIBERO)的模拟基准测试以及各种真实机器人任务上,展示了IVRA的通用性。在低数据环境下的2D VIMA中,与基础模型LLaRA相比,IVRA平均成功率提高了+4.2%。在3D LIBERO场景中,与OpenVLA及FLOWER基线相比,它保持了一致性的改进,即使是在基准准确度接近饱和(96.3%到97.1%)的情况下也是如此。 所有代码和模型将公开发布。可视化材料可在此处访问:此HTTP链接。
https://arxiv.org/abs/2601.16207
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
将透视图像和视频转换为360°全景图能够实现沉浸式的三维世界生成。现有的方法通常依赖于透视视图与等距矩形投影(ERP)空间之间的显式几何对齐。然而,这需要已知的相机元数据,在野外的数据中,这种校准通常是缺失或有噪声的。我们提出了一种名为360Anything的新框架,该框架基于预训练的扩散变换器构建,并且不需要任何几何信息。通过将透视输入和全景图目标视为简单的令牌序列,360Anything能够以完全数据驱动的方式学习透视到等距矩形映射,从而消除了对相机信息的需求。 我们的方法在图像和视频从透视视图到360°生成的性能上达到了最先进的水平,并且超越了那些使用真实相机信息的方法。我们还追溯到了ERP边界处的接缝瑕疵的根本原因——VAE编码器中的零填充处理,并引入了圆形潜在编码以促进无缝生成。 最后,我们在无提示相机视野和方向估计基准测试中展示了具有竞争力的结果,这表明360Anything具备深刻的几何理解能力以及在计算机视觉任务中的更广泛实用性。更多的研究成果可以访问此链接:[提供的URL]。 简而言之,这项工作展示了一种创新的方法来处理没有明确几何对齐信息的图像和视频数据,并且证明了这种方法在多种应用中的有效性和广泛的适用性。
https://arxiv.org/abs/2601.16192
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
生成动画的三维对象是许多应用程序的核心,然而大多数先进的研究成果通常难以在实践中应用,原因在于其设置有限、运行时间长或质量不佳。我们介绍了ActionMesh,这是一种生成模型,它能够以前馈方式预测“行动中”的生产级3D网格(mesh)。借鉴早期视频模型的灵感,我们的关键洞察是修改现有的3D扩散模型,使其包括一个时间轴,从而形成所谓的“时序3D扩散”框架。 具体来说,我们首先将3D扩散阶段调整为生成一系列同步潜变量序列,这些序列代表随时间变化且独立的三维形状。其次,我们设计了一个时序3D自动编码器,该编码器可以将一系列独立的形状转换成预定义参考形状的相应变形,从而构建动画。结合这两个组件,ActionMesh可以从不同的输入中生成动画的3D网格,如单目视频、文本描述或带有描述其动画的文本提示的3D网格。 此外,与之前的方法相比,我们的方法速度快,并且产生的结果无骨骼绑定(rig-free)和拓扑一致,因此能够快速迭代并支持无缝应用如纹理映射和重定向。我们在标准的视频到4D基准测试(Consistent4D、Objaverse)上评估了我们的模型,在几何准确性和时间一致性方面均达到了最先进的性能水平,证明了该模型可以以前所未有的速度和质量提供动画3D网格。
https://arxiv.org/abs/2601.16148
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
基于语言的灵巧抓取生成需要模型理解任务语义、三维几何结构和复杂的手-物体交互。尽管视觉-语言模型已被应用于此类问题,但现有方法直接将观察结果映射到抓取参数上,并未经过关于物理交互的中间推理过程。我们提出了DextER(具有具身推理的灵巧抓取生成),这是一种基于接触点进行多指操作具身推理的方法。我们的关键洞察是,预测哪些手部链接在物体表面上何处接触可以提供一种感知身体特性的中间表示形式,将任务语义与物理约束连接起来。 DextER通过自回归方式生成具身接触令牌,这些令牌指定手指链在哪部分物体表面接触,随后生成抓取令牌来编码手部配置。在DexGYS数据集上,DextER实现了67.14%的成功率,比现有最佳方法高出3.83个百分点,并且意图对齐提高了96.4%。此外,我们还展示了通过部分接触规范实现可控制的生成过程,提供了对手部抓取合成进行精细调控的能力。 该研究强调了在灵巧抓取任务中引入物理交互理解的重要性,展示了一种将视觉-语言模型与复杂手-物体互动结合的有效途径,并为机器人操作和自动化系统中的精细化具身推理设定了新标准。
https://arxiv.org/abs/2601.16046
We study the problem of collision-free humanoid traversal in cluttered indoor scenes, such as hurdling over objects scattered on the floor, crouching under low-hanging obstacles, or squeezing through narrow passages. To achieve this goal, the humanoid needs to map its perception of surrounding obstacles with diverse spatial layouts and geometries to the corresponding traversal skills. However, the lack of an effective representation that captures humanoid-obstacle relationships during collision avoidance makes directly learning such mappings difficult. We therefore propose Humanoid Potential Field (HumanoidPF), which encodes these relationships as collision-free motion directions, significantly facilitating RL-based traversal skill learning. We also find that HumanoidPF exhibits a surprisingly negligible sim-to-real gap as a perceptual representation. To further enable generalizable traversal skills through diverse and challenging cluttered indoor scenes, we further propose a hybrid scene generation method, incorporating crops of realistic 3D indoor scenes and procedurally synthesized obstacles. We successfully transfer our policy to the real world and develop a teleoperation system where users could command the humanoid to traverse in cluttered indoor scenes with just a single click. Extensive experiments are conducted in both simulation and the real world to validate the effectiveness of our method. Demos and code can be found in our website: this https URL.
我们研究了在充满障碍物的室内环境中实现人类机器人无碰撞行走的问题,包括跨越地面上散落的物体、低矮障碍物下蹲通过或穿过狭窄通道等情况。为了达到这一目标,人类机器人需要将其对周围障碍物的各种空间布局和几何形状的认知映射到相应的行走技能上。然而,在避免碰撞的过程中缺乏有效的表示方法来捕捉人形机器人与障碍物之间的关系,这使得直接学习这种映射变得困难。因此,我们提出了“人形势场”(HumanoidPF),它将这些关系编码为无碰撞运动方向,从而显著促进了基于强化学习的行走技能的学习过程。此外,我们发现作为感知表示方法,HumanoidPF在模拟环境与真实环境之间的差异非常小。 为了进一步使机器人能够在多样且具有挑战性的充满障碍物的室内环境中具备泛化的行走能力,我们还提出了一种混合场景生成方法,结合了现实世界3D室内环境的截屏以及程序合成的障碍物。我们成功地将策略转移到实际环境中,并开发了一个遥操作系统,在该系统中用户只需点击一下即可指挥人形机器人在充满障碍物的室内环境中行走。 我们在模拟和真实世界环境中进行了广泛的实验以验证我们的方法的有效性。演示视频和代码可以在我们的网站上找到:此链接。
https://arxiv.org/abs/2601.16035
Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
城市静态和动态场景的新型视图合成(NVS)对于自动驾驶仿真至关重要,但现有方法往往难以在重建时间和质量之间取得平衡。尽管最先进的神经辐射场和3D高斯点阵方法实现了照片级真实感,它们通常依赖于耗时的每场景优化过程。相反,新兴的前馈方法经常采用像素级别的高斯表示,在复杂的动态环境中聚合多视图预测会导致三维不一致性。 我们提出了EvolSplat4D,这是一种超越现有基于像素范式的前馈框架,通过在三个专门分支中统一了体积和像素基础的高斯预测。对于近距离静态区域,我们在3D特征体直接从多个帧预测一致的3D高斯几何,并辅以增强语义的图像渲染模块来预测其外观。对于动态物体,我们利用对象中心化的规范空间以及运动调整渲染模块来聚合时间特性,确保即使在噪声运动先验下也能实现稳定的4D重建。远距离场景则通过一个高效的像素级别高斯分支处理,以确保全场景覆盖。 实验结果表明,在KITTI-360、KITTI、Waymo和PandaSet数据集上,EvolSplat4D能够以更高的精度和一致性重建静态及动态环境,并且优于每场景优化以及最先进的前馈基准方法。
https://arxiv.org/abs/2601.15951
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
目的:准确的三维手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重的挑战,包括强烈的局部照明、频繁的手被仪器或工作人员遮挡以及由于手套导致的手部外观一致化,并且缺乏可靠的模型训练所需的数据集。 方法:我们提出了一种稳健的多视角流水线,用于在手术环境中进行三维手部姿态估计,该流水线无需特定领域的微调,仅依赖现成的预训练模型。这个流程包括可靠的人体检测、全身姿势估计和基于跟踪的手部裁剪区域内的最先进的二维关键点预测,并通过受约束的三维优化来完成整个过程。此外,我们引入了一个新颖的手术基准数据集,该数据集包含超过68,000帧及3,000个手动注释的二维手部姿态,这些数据是在一个模拟手术室中记录下来的,在不同的场景复杂度下都有三角测量的三维真实值。 结果:定量实验表明,我们的方法在性能上始终优于基准模型,实现了2D平均关节误差降低31%,以及3D平均每关节位置误差减少76%的成绩。 结论:我们提出的工作为手术环境中的三维手部姿态估计建立了坚实的基础,提供了一个无需训练的流水线和一个全面注释的数据集,以促进未来在手术计算机视觉领域的研究。
https://arxiv.org/abs/2601.15918
Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.
将RGB数据与热红外数据融合进行多模态场景重建对于在不同光照和天气条件下实现稳健的环境感知至关重要。然而,将3D高斯点阵(3D Gaussian Splatting, 3DGS)扩展到多光谱场景仍然面临挑战。当前的方法往往难以充分利用多模态数据的互补信息,通常依赖于要么忽视跨模态相关性、要么使用共享表示来处理复杂结构关联和光谱间物理差异的方式,这些方式往往不能适应性地应对问题。 为了解决这些问题,我们提出了ThermoSplat,这是一种新的框架,通过主动特征调制和自适应几何解耦实现深度光谱感知重建。首先,我们引入了跨模态FiLM(Feature-wise Linear Modulation)调制机制,该机制能够动态地将共享的潜在特征与热结构先验条件相结合,从而有效地利用可靠的跨模态几何线索来指导可见纹理合成。其次,为了适应特定于模式的几何不一致性,我们提出了一个模态自适应几何解耦方案,学习独立的不透明度偏移,并为热谱支路执行独立的光栅化传递。 此外,还采用了混合渲染管线,结合了显式的球谐函数和隐式神经解码,确保语义一致性和高频细节保存。在RGBT-Scenes数据集上的广泛实验表明,ThermoSplat实现了跨可见光谱和热光谱的最先进的渲染质量。
https://arxiv.org/abs/2601.15897
Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at this https URL.
电子显微镜(EM)中的神经元分割旨在重建完整的神经连接图;然而,目前基于深度学习的方法受到大规模训练数据和耗时的手动注释的限制。传统方法通过几何变换和光度变换扩充训练集,但生成的数据样本与原始图像高度相关,并且缺乏结构多样性。为了解决这一局限性,我们提出了一种基于扩散模型的数据增强框架,该框架能够生成多样化且在结构上合理的图象标签对用于神经元分割。具体来说,此框架采用了一个分辨率感知的条件扩散模型,结合多尺度调节和EM分辨率先验信息,使从3D掩码中合成像素级图像成为可能。此外,它还包含一个由生物学指导的掩模重塑模块,该模块能够生成结构真实性增强的增强掩模。这些组件共同有效地丰富了训练集,并提高了分割性能。在低标注环境下的AC3和AC4数据集中,与两种不同的后处理方法结合使用时,我们的方法分别使ARAND指标提升了32.1%和30.7%。代码可在提供的URL地址获取。 注意:原文中的URL因格式要求未能具体显示,请访问相关研究的发布页面或直接联系作者以获取准确的代码链接。
https://arxiv.org/abs/2601.15779
This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM's segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM's robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM's potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.
本文介绍了FeTal-SAM,这是对分割任何模型(SAM)的一种创新改编版本,专门用于胎儿大脑MRI的分割。传统的深度学习方法通常需要大量标注的数据集来处理固定的标签集合,在临床或研究需求变化时显得不够灵活。通过整合基于图谱的提示和基础模型的原则,FeTal-SAM解决了胎儿大脑MRI分割中的两个关键限制:(1) 需要为不同的标签定义重新训练模型;以及 (2) 缺乏了解分割结果是由真正的图像对比度驱动还是由学习到的空间先验决定。我们利用多图谱配准技术生成与空间对齐的标签模板,这些模板作为密集提示与边界框提示一起用于SAM的分割解码器中。这一策略实现了基于每种结构进行二值分割,并随后融合以重构完整的3D分割体积。 在两个数据集(dHCP数据集和内部数据集)上的评估显示了FeTal-SAM在不同妊娠年龄下的稳健性能。值得注意的是,它对于具有良好对比度的结构如皮层板和小脑,在Dice评分上与为每个数据集和标签定义训练的状态-of-the-art基准线相当,同时保持对任何用户指定解剖结构进行分割的能力。尽管对于细微、低对比度的结构(例如海马体、杏仁核)准确率略低,但我们的结果突显了FeTal-SAM作为通用分割模型的巨大潜力,在无需详尽重新训练的情况下满足需求变化。 因此,该方法为临床适应性强的胎儿大脑MRI分析工具的发展提供了一个有前景的方向。
https://arxiv.org/abs/2601.15759
3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at this https URL.
三维占用预测在自动驾驶领域中扮演着至关重要的角色,因为它提供了对驾驶环境的全面理解。大多数现有的方法构建密集场景表示来进行占用预测,却忽视了真实世界驾驶场景内在的稀疏性。最近,3D超二次体(superquadric)表示作为一种有前景的稀疏替代方案出现,可以弥补密集场景表示的不足,因为超二次体具有强大的几何表达能力。然而,现有的超二次体框架仍然在时间建模不足、查询稀疏性和几何表达力之间的艰难权衡以及不高效的超二次体到体素转换(splatting)方面存在问题。 为了解决这些问题,我们提出了SuperOcc,这是一个基于超二次体的三维占用预测的新框架。SuperOcc整合了三个关键设计:(1) 一个连贯的时间建模机制,同时利用以视点为中心和以对象为中心的时间线索;(2) 多个超二次体解码策略,以增强几何表达力而不牺牲查询稀疏性;以及 (3) 一种高效的从超二次体到体素的转换方案,从而提高计算效率。在SurroundOcc和Occ3D基准测试上进行的一系列实验表明,SuperOcc达到了最先进的性能,同时保持了卓越的效率。代码可以在提供的链接处获取。
https://arxiv.org/abs/2601.15644
Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at this https URL.
物体目标导航(ObjectNav)要求代理自主探索未知环境,并根据语义标签向特定的目标对象移动。尽管先前的工作主要研究了在二维运动情况下零样本的ObjectNav问题,但将其扩展到具有三维运动能力的空中平台仍然是一个未充分探讨的研究领域。空中机器人提供了更好的机动性和搜索效率,但也引入了空间感知、动态控制和安全保证等方面的新挑战。在这篇论文中,我们提出了AION,这是一种基于视觉的无人飞行器(UAV)物体导航框架,无需依赖外部定位或全局地图。AION是一个端到端的双策略强化学习(RL)框架,它将探索行为与目标达到行为分解为两个专门化的策略。我们在AI2-THOR基准上评估了AION,并进一步在IsaacSim中使用高保真无人机模型对其实时性能进行了测试。实验结果表明,AION在探索、导航效率和安全性等综合评价指标方面取得了卓越的性能。相关视频可以在此URL找到(原文中的链接未提供,您可以访问原始论文或提供的URL以获取视频)。
https://arxiv.org/abs/2601.15614
Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.
从低动态范围(LDR)模糊图像合成高质量的新型视图,在极端光照条件下恢复高动态范围(HDR)和清晰的3D表示是一个挑战。虽然现有方法利用事件数据来解决这一问题,但它们忽视了摄像机输出与物理世界辐射度之间的传感器-物理学不匹配,导致HDR和去模糊结果不佳。 为了应对这个问题,我们提出了一种统一的基于传感器物理学的NeRF框架,用于从单次曝光模糊LDR图像及其对应事件中合成清晰的HDR新型视图。该方法直接利用NeRF表示3D场景的实际辐射度,并在高动态范围下模拟原始HDR场景光线击打传感器像素的情况,如同物理世界中的表现一样。我们引入了一个逐像素RGB映射字段,用于将上述渲染像素值与输入图像中传感器记录的LDR像素值对齐。还设计了一种新的事件映射字段来连接物理场景动力学和实际事件传感器输出。 两个映射字段与NeRF网络联合优化,利用事件中的空间和时间动态信息以增强清晰HDR 3D表示的学习能力。在收集的数据集和公开数据集中进行的实验表明,我们的方法能够使用单次曝光模糊LDR图像及其对应的事件来实现最先进的去模糊HDR新型视图合成结果。
https://arxiv.org/abs/2601.15475
This study presents an integrated framework for enhancing the safety and operational efficiency of robotic arms in laparoscopic surgery by addressing key challenges in collision detection and minimum distance estimation. By combining analytical modeling, real-time simulation, and machine learning, the framework offers a robust solution for ensuring safe robotic operations. An analytical model was developed to estimate the minimum distances between robotic arms based on their joint configurations, offering precise theoretical calculations that serve as both a validation tool and a benchmark. To complement this, a 3D simulation environment was created to model two 7-DOF Kinova robotic arms, generating a diverse dataset of configurations for collision detection and distance estimation. Using these insights, a deep neural network model was trained with joint actuators of robot arms and relative positions as inputs, achieving a mean absolute error of 282.2 mm and an R-squared value of 0.85. The close alignment between predicted and actual distances highlights the network's accuracy and its ability to generalize spatial relationships. This work demonstrates the effectiveness of combining analytical precision with machine learning algorithms to enhance the precision and reliability of robotic systems.
这项研究提出了一种综合框架,通过解决碰撞检测和最小距离估算中的关键挑战来增强腹腔镜手术中机械臂的安全性和运行效率。该框架结合了分析建模、实时仿真和机器学习技术,提供了一个稳健的解决方案以确保安全的机器人操作。 为了实现这一目标,研究人员开发了一个基于关节配置估计机械臂之间最小距离的分析模型,提供了精确的理论计算,既可以用作验证工具,也可以作为基准。同时,创建了一个三维模拟环境来建模两台7自由度(DOF)的Kinova机械臂,生成了用于碰撞检测和距离估算的一系列配置数据。 利用这些见解,训练了一种深层神经网络模型,以关节驱动器和相对位置为输入,实现了282.2毫米的平均绝对误差和0.85的R平方值。预测与实际距离之间的高度一致性突显了该网络在空间关系上的准确性和泛化能力。 这项工作展示了将分析精度与机器学习算法相结合的有效性,以增强机器人系统的精确度和可靠性。
https://arxiv.org/abs/2601.15459
Radiance field-based rendering methods have attracted significant interest from the computer vision and computer graphics communities. They enable high-fidelity rendering with complex real-world lighting effects, but at the cost of high rendering time. 3D Gaussian Splatting solves this issue with a rasterisation-based approach for real-time rendering, enabling applications such as autonomous driving, robotics, virtual reality, and extended reality. However, current 3DGS implementations are difficult to integrate into traditional mesh-based rendering pipelines, which is a common use case for interactive applications and artistic exploration. To address this limitation, this software solution uses Nvidia's interprocess communication (IPC) APIs to easily integrate into implementations and allow the results to be viewed in external clients such as Unity, Blender, Unreal Engine, and OpenGL viewers. The code is available at this https URL.
基于辐射场的渲染方法已经引起了计算机视觉和计算机图形学社区的极大兴趣。这些方法能够实现具有复杂真实世界光照效果的高保真渲染,但代价是较长的渲染时间。3D高斯点阵(3D Gaussian Splatting)通过一种基于光栅化的实时渲染方法解决了这个问题,从而支持自动驾驶、机器人技术、虚拟现实和扩展现实等应用。然而,目前的3DGS实现方案难以整合到传统的网格基渲染流水线中,而这是交互式应用程序和艺术探索中的常见需求。为了克服这一限制,该软件解决方案使用Nvidia的进程间通信(IPC)API,以方便地集成到各种实现中,并允许在Unity、Blender、Unreal Engine和OpenGL查看器等外部客户端中查看结果。代码可在[此处](https://this https URL)获取。 请注意,“此URL”部分由于安全原因被省略了,请直接访问原始文档或源码库以找到正确的链接地址。
https://arxiv.org/abs/2601.15431
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at this https URL
文本到图像(T2I)模型取得了显著的进步,但在处理需要同时处理多个对象、关系和属性的复杂提示时仍然面临挑战。现有的推理时间策略,如使用验证器进行并行采样或简单增加去噪步骤,可以改善提示对齐但依然不足以应对具有许多约束条件的丰富组合设置。受大型语言模型中链式思维推理成功的启发,我们提出了一种迭代测试时间策略,在该策略下,T2I模型在多个步骤中逐步改进其生成结果,并由循环中的视觉-语言模型作为批评者提供反馈指导。我们的方法简单、无需外部工具或先验条件,且可以灵活应用于各种图像生成器和视觉-语言模型。通过实证研究,我们证明了在基准测试上图像生成的一致性提升:ConceptMix(k=7)的全正确率提高了16.9%,T2I-CompBench(3D-Spatial类别)提升了13.8%,以及Visual Jenga场景分解提升了12.5%,与计算量匹配的并行采样相比。除了定量改进,迭代细化通过将复杂提示分解为顺序修正来生成更忠实的结果,并且人类评估者在58.7%的时间内更倾向于我们的方法,相比之下,41.3%的时间偏向于并行基线方法。综上所述,这些发现突显了迭代自我纠正作为组合图像生成广泛适用的原则。结果和可视化可以在提供的链接中找到。
https://arxiv.org/abs/2601.15286
We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see this https URL.
我们提出了一种新颖的方法,用于从单次多视角场景捕捉中进行室内场景的交互式光线编辑。我们的方法利用一种生成性的基于图像的光分解模型,将复杂的室内场景照明分解为其构成光源。这种分解使我们能够独立地操纵每个光源,具体来说,可以控制它们的状态(开/关)、色度和强度。此外,我们还引入了多视角光照调和技术,以确保所有场景视图中的光线分解的一致传播。这项技术被整合进一个可重新照明的3D高斯点源表示中,提供对单个光源进行实时交互式控制的能力。我们的结果展示了在各种室内场景中高度逼真的光线分解与重照明效果。 我们在合成数据集和真实世界数据集上评估了该方法,并提供了与最先进的技术之间的定量和定性比较。有关视频结果和交互式演示,请参见此网址 [https URL]。
https://arxiv.org/abs/2601.15283
We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.
我们研究了用于处理一组带姿势输入图像的标记的多视角变压器的位置编码,并寻求一种机制,该机制能够独特地编码补丁位置、允许具有多频相似性的SE(3)不变性注意力以及能适应场景几何结构。我们发现先前的(绝对或相对)多视图注意编码方案无法满足上述要求,并提出了RayRoPE来解决这一空白。RayRoPE基于关联光线表示补丁位置,但利用预测的光线上的一点而不是方向进行具有几何感知性的编码。为了实现SE(3)不变性,RayRoPE计算查询框架投影坐标以计算多频相似性。最后,由于沿光线预测的3D点可能不够精确,RayRoPE提供了一种机制来在不确定性下分析地计算预期的位置编码。我们在新视角合成和立体深度估计任务上验证了RayRoPE,并显示它始终优于其他位置编码方案(例如,在CO3D上的LPIPS相对改进15%)。我们还展示了RayRoPE可以无缝集成RGB-D输入,这比无法进行此类信息定位编码的替代方法获得了更大的收益。
https://arxiv.org/abs/2601.15275