Real-world data collection for robotics is costly and resource-intensive, requiring skilled operators and expensive hardware. Simulations offer a scalable alternative but often fail to achieve sim-to-real generalization due to geometric and visual gaps. To address these challenges, we propose a 3D-photorealistic real-to-sim system, namely, RE$^3$SIM, addressing geometric and visual sim-to-real gaps. RE$^3$SIM employs advanced 3D reconstruction and neural rendering techniques to faithfully recreate real-world scenarios, enabling real-time rendering of simulated cross-view cameras within a physics-based simulator. By utilizing privileged information to collect expert demonstrations efficiently in simulation, and train robot policies with imitation learning, we validate the effectiveness of the real-to-sim-to-real pipeline across various manipulation task scenarios. Notably, with only simulated data, we can achieve zero-shot sim-to-real transfer with an average success rate exceeding 58%. To push the limit of real-to-sim, we further generate a large-scale simulation dataset, demonstrating how a robust policy can be built from simulation data that generalizes across various objects. Codes and demos are available at: this http URL.
机器人技术中的现实世界数据收集成本高昂且耗费资源,需要熟练的操作员和昂贵的硬件设备。虽然仿真提供了一种可扩展的替代方案,但由于几何结构和视觉上的差距,它们往往难以实现从仿真到实际环境(sim-to-real)的一致性转移。为了解决这些问题,我们提出了一种名为RE$^3$SIM的真实至仿真的三维写实系统,该系统旨在解决几何和视觉方面的仿真与现实之间的差异。 RE$^3$SIM采用先进的三维重建技术和神经渲染技术来忠实再现真实世界的场景,并且能够在基于物理的仿真器中实时地模拟跨视角相机的画面。通过利用特权信息高效地收集专家在仿真环境中的演示数据,以及使用模仿学习训练机器人策略,我们验证了从现实到仿真再到实际应用(real-to-sim-to-real)这一流程的有效性,涵盖各种操作任务场景。 值得注意的是,仅使用仿真的数据,就可以实现零样本的仿真至实际转移,平均成功率达到58%以上。为了进一步推动真实世界向仿真环境的转换极限,我们还生成了一个大规模的仿真数据集,展示了如何从模拟数据中构建出能够在不同物体上通用的强大策略。 代码和演示可以在以下链接找到:[提供链接](请注意,原始信息中的具体网址需要手动输入)。
https://arxiv.org/abs/2502.08645
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.
在这项工作中,我们介绍了CineMaster,这是一个用于三维感知和可控的文本到视频生成的新颖框架。我们的目标是赋予用户与专业电影导演相当的操作控制能力:精确地在场景中放置物体,在三维空间内灵活操作物体和相机,并对渲染帧进行直观布局控制。为了实现这一目标,CineMaster分为两个阶段工作。 第一阶段,我们设计了一个交互式的工作流程,允许用户通过在三维空间中定位物体边界框并定义相机移动来直观地构建具有三维感知的条件信号。第二阶段,这些控制信号——包括渲染深度图、相机轨迹和物体类别标签——作为文本到视频扩散模型的指导,确保生成用户意图的视频内容。 此外,为了克服缺少带有3D对象运动和相机姿态注释的真实场景数据集的问题,我们精心建立了一个自动化数据标注流水线,从大规模视频数据中提取3D边界框和相机轨迹。大量的定性和定量实验表明,CineMaster在三维感知文本到视频生成方面显著优于现有方法。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2502.08639
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
尽管大型多模态模型(LMMs)在视觉场景理解和推理方面表现出色,但它们在复杂和精确的三维空间推理能力仍存在不确定性。现有基准测试主要侧重于二维空间理解,并缺乏全面评估不同难度下六维空间推理的框架。为了解决这一局限性,我们提出了PulseCheck457,这是一个可扩展且无偏见的合成数据集,设计了四个关键的空间推理能力:多对象识别、二维位置、三维位置和三维方向。我们构建了一个级联评估结构,在五个难度级别上创建了七种问题类型,从基本的单个物体识别到我们新提出的复杂六维空间推理任务。我们在PulseCheck457上对各种大型多模态模型(LMMs)进行了评估,观察到了随着任务复杂度增加性能普遍下降的现象,特别是在三维推理和六维空间任务中尤为明显。为了量化这些挑战,我们引入了相对性能下降率(RPDR),突出了在三维推理能力中的关键弱点。利用数据集无偏的设计特性,我们也揭示了不同属性的预测偏差,在真实世界的图像设置中也观察到了类似的模式。
https://arxiv.org/abs/2502.08636
Recent advancements in Augmented Reality (AR) have demonstrated applications in architecture, design, and fabrication. Compared to conventional 2D construction drawings, AR can be used to superimpose contextual instructions, display 3D spatial information and enable on-site engagement. Despite the potential of AR, the widespread adoption of the technology in the industry is limited by its precision. Precision is important for projects requiring strict construction tolerances, design fidelity, and fabrication feedback. For example, the manufacturing of glulam beams requires tolerances of less than 2mm. The goal of this project is to explore the industrial application of using multiple fiducial markers for high-precision AR fabrication. While the method has been validated in lab settings with a precision of 0.97, this paper focuses on fabricating glulam beams in a factory setting with an industry manufacturer, Unalam Factory.
最近的增强现实(AR)技术进步展示了其在建筑、设计和制造领域的应用潜力。与传统的二维施工图相比,AR可以用于叠加上下文指令,显示三维空间信息,并实现现场互动。尽管AR具有巨大潜力,但因其精确度问题,在行业中的广泛采用受到限制。对于需要严格建造公差、设计精度以及制造反馈的项目而言,精确度至关重要。例如,胶合木梁的生产要求公差小于2毫米。本项目的目的是探索使用多个标识符进行高精度AR制造在工业应用中的可行性。虽然该方法已经在实验室环境中以0.97的精度得到验证,但本文重点关注与行业制造商Unalam Factory合作,在工厂环境下制作胶合木梁的过程。
https://arxiv.org/abs/2502.08566
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
随着纵向磁共振成像(MRI)数据集的日益普及,基于人工智能(AI)的疾病进展建模得到了促进,使得为个别患者预测未来的医学扫描成为可能。然而,尽管在AI领域取得了显著的进步,目前的方法仍然面临着一些挑战,包括实现以患者为中心的个性化、确保时空一致性、有效利用纵向数据以及管理3D扫描带来的巨大内存需求。为了应对这些挑战,我们提出了一种新颖的时空模型——大脑潜在进展(BrLP),旨在预测个体层面在3D脑MRI中的疾病进展。 BrLP的关键贡献有四点:(i) 它在一个较小的潜在空间中运行,从而减轻了高维影像数据带来的计算难题;(ii) 它明确整合了受试者元数据以增强预测的个性化;(iii) 通过辅助模型将疾病的动态知识纳入其中,促进了纵向数据的集成;(iv) 引入了潜在平均稳定化(LAS)算法,该算法(a)在推理时强制执行预测进展中的时空一致性,并(b)允许我们推导出预测不确定性的度量。 我们在2805名受试者的11,730张T1加权(T1w)脑MRI上训练并评估了BrLP,并通过962名受试者组成的2,257张外部测试集验证了其泛化能力。我们的实验将由BrLP生成的MRI扫描与真实的随访MRI进行了比较,证明了相对于现有方法而言达到了最先进的准确性。代码可在以下链接公开获取:this https URL.
https://arxiv.org/abs/2502.08560
Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
人类的理解和生成对于数字人及仿人模型的构建至关重要。最近,受大型语言和视觉模型等通用模型成功的启发,以人类为中心的基础模型(HcFMs)兴起并致力于将各种以人为中心的任务整合到一个统一框架中,从而超越了传统的特定任务方法。在这篇综述中,我们提出了一种分类法,通过将其当前的方法分为四个类别来全面概述HcFMs:(1) 以人类为中心的感知基础模型,捕捉多模态2D和3D理解中的细微特征;(2) 以人为中心的人工智能生成(AIGC)基础模型,能够生成高保真度、多样化的人类相关内容;(3) 统一感知与生成模型,整合这些能力以增强人类理解和合成;以及 (4) 以人类为中心的代理基础模型,超越感知和生成,学习类似人的智慧及用于仿人任务中的交互行为。我们回顾了最新的技术,并讨论了新兴挑战和未来的研究方向。该综述旨在为致力于更稳健、多样化且智能的数字人和仿生体建模的研究人员和实践者提供路线图。
https://arxiv.org/abs/2502.08556
Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities with an average success rate of 90\% in four real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on this https URL.
在机器人操作领域,实现与人类相当的灵巧性是关键目标之一。近年来,基于3D模仿学习的进步显示出了令人鼓舞的结果,为达成这一目标提供了有效途径。然而,获取高质量的3D表示存在两个主要问题:(1)单视图相机捕获的点云质量显著受到相机分辨率、定位以及灵巧手造成的遮挡等因素的影响;(2)全局点云缺乏关键的接触信息和空间对应关系,这对精细的手部操作任务至关重要。为了消除这些限制,我们提出了一种新的框架CordViP,该框架通过利用物体稳健的6D姿态估计及机器人内感觉数据来构建并学习对应的交互感知点云。 具体来说,我们首先引入了具有互动意识的点云,建立了手和对象之间的对应关系。然后使用这些点云进行预训练策略,并结合以物体为中心的接触图和手臂协调信息,有效捕捉空间和时间动态变化。我们的方法在四个真实世界任务中的平均成功率为90%,远超其他基线模型的表现。实验结果还强调了CordViP在面对不同对象、视角及场景时表现出的卓越泛化能力和鲁棒性。代码和视频可在提供的链接中获取:[此 URL](https://this-url.com)(原文中未提供具体URL,此处用占位符表示)。
https://arxiv.org/abs/2502.08449
Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the portions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects a similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.
最近,从视频生成动态3D对象取得了令人印象深刻的结果。现有方法直接使用帧中所有信息来优化高斯分布。然而,当动态区域与静态区域交织在一起,特别是如果静态区域占较大比例时,现有的方法往往忽视了动态区域中的信息,并且容易在静态区域过度拟合。这导致生成结果出现模糊纹理的问题。我们认为,分离动态和静态特征以增强动态表示可以缓解这一问题。因此,我们提出了一个动态-静态特征解耦模块(DSFD)。沿时间轴,它将当前帧特征中相对于参考帧特征具有显著差异的部分视为动态特征;而其余部分则被视为静态特征。随后,我们根据动态特征与当前帧特征获取分离的特征。此外,为了进一步增强从不同视角获得的解耦特征中的动态表示,并确保准确的动作预测,我们设计了一个时空相似性融合模块(TSSF)。沿空间轴,它自适应地选择动态区域的类似信息。基于上述方法,我们构建了一种新的方法DS4D。实验结果验证了我们的方法在视频到4D转换中取得了最先进的(SOTA)成果。此外,在一个真实场景数据集上的实验表明其在4D场景中的有效性。我们将公开发布代码。
https://arxiv.org/abs/2502.08377
Accurate segmentation of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology segmentation as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning (SSL) for feature extraction, eliminating the need for supervised pre-training, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Code and pre-trained models will be made publicly available.
在三维医学图像中准确分割所有病理发现仍然是一项重大挑战,因为监督模型仅限于检测现有数据集中标注的少数病理类型。为了解决这个问题,我们将病理分割视为一个无监督视觉异常分割(UVAS)问题,并利用病理模式与健康模式相比固有的稀有性。我们通过两个关键创新来增强现有的基于密度的UVAS框架:(1) 用于特征提取的密集自监督学习(SSL),消除了对监督预训练的需求;(2) 学习到的、不受掩码影响的密集特性作为条件变量,取代手工制作的位置编码。在超过30,000个未标记的三维CT体积上进行训练后,我们的模型Screener在四个包含1,820张扫描图像(涵盖各种病理情况)的大规模测试数据集上超越了现有的UVAS方法。代码和预训练模型将公开提供。
https://arxiv.org/abs/2502.08321
Volumetric video enables immersive experiences by capturing dynamic 3D scenes, enabling diverse applications for virtual reality, education, and telepresence. However, traditional methods struggle with fixed lighting conditions, while neural approaches face trade-offs in efficiency, quality, or adaptability for relightable scenarios. To address these limitations, we present BEAM, a novel pipeline that bridges 4D Gaussian representations with physically-based rendering (PBR) to produce high-quality, relightable volumetric videos from multi-view RGB footage. BEAM recovers detailed geometry and PBR properties via a series of available Gaussian-based techniques. It first combines Gaussian-based performance tracking with geometry-aware rasterization in a coarse-to-fine optimization framework to recover spatially and temporally consistent geometries. We further enhance Gaussian attributes by incorporating PBR properties step by step. We generate roughness via a multi-view-conditioned diffusion model, and then derive AO and base color using a 2D-to-3D strategy, incorporating a tailored Gaussian-based ray tracer for efficient visibility computation. Once recovered, these dynamic, relightable assets integrate seamlessly into traditional CG pipelines, supporting real-time rendering with deferred shading and offline rendering with ray tracing. By offering realistic, lifelike visualizations under diverse lighting conditions, BEAM opens new possibilities for interactive entertainment, storytelling, and creative visualization.
体积视频通过捕捉动态的三维场景,提供了沉浸式的体验,并适用于虚拟现实、教育和远程呈现等多样化应用。然而,传统方法在处理固定光照条件时面临挑战,而神经网络方法则在效率、质量或可调节性方面对可重新照明的情况存在取舍问题。为了解决这些问题,我们提出了BEAM(基于高斯表示与物理基础渲染的新型管道),它将4D高斯表示与基于物理的基础渲染技术相结合,从而可以从多视角RGB视频中生成高质量且支持重新照明的体积视频。 BEAM通过一系列基于高斯的技术恢复详细的几何结构和基于物理的基础渲染属性。首先,它结合了基于高斯的表现跟踪以及具有几何感知功能的光栅化,在一个从粗到细的优化框架内工作以恢复空间和时间上的一致性几何体。然后逐步增强高斯属性,并引入基于物理基础的渲染特性。我们利用多视角条件下的扩散模型生成粗糙度,再通过2D至3D策略推导出AO(环境光遮蔽)与基色,并采用专门设计的基于高斯的光线追踪器进行高效的可见性计算。 一旦这些动态且支持重新照明的资源被恢复后,它们可以无缝集成到传统的计算机图形流水线中,在延迟着色的支持下实现实时渲染,并通过光线追踪技术进行离线渲染。BEAM提供的在多样化光照条件下的现实和逼真的可视化效果为交互式娱乐、叙事以及创意可视化开辟了新的可能性。
https://arxiv.org/abs/2502.08297
Differentiating signals from the background in micrographs is a critical initial step for cryogenic electron microscopy (cryo-EM), yet it remains laborious due to low signal-to-noise ratio (SNR), the presence of contaminants and densely packed particles of varying sizes. Although image segmentation has recently been introduced to distinguish particles at the pixel level, the low SNR complicates the automated generation of accurate annotations for training supervised models. Moreover, platforms for systematically comparing different design choices in pipeline construction are lacking. Thus, a modular framework is essential to understand the advantages and limitations of this approach and drive further development. To address these challenges, we present a pipeline that automatically generates high-quality segmentation maps from cryo-EM data to serve as ground truth labels. Our modular framework enables the selection of various segmentation models and loss functions. We also integrate Conditional Random Fields (CRFs) with different solvers and feature sets to refine coarse predictions, thereby producing fine-grained segmentation. This flexibility facilitates optimal configurations tailored to cryo-EM datasets. When trained on a limited set of micrographs, our approach achieves over 90% accuracy, recall, precision, Intersection over Union (IoU), and F1-score on synthetic data. Furthermore, to demonstrate our framework's efficacy in downstream analyses, we show that the particles extracted by our pipeline produce 3D density maps with higher resolution than those generated by existing particle pickers on real experimental datasets, while achieving performance comparable to that of manually curated datasets from experts.
在冷冻电子显微镜(cryo-EM)中,从背景中区分信号是至关重要的初始步骤,但由于信噪比低、存在污染物以及颗粒密度高且尺寸各异等问题,这一过程仍然非常繁琐。尽管最近引入了图像分割技术来在像素级别上区分粒子,但低信噪比使得自动生成准确的训练监督模型所需的标注变得复杂。此外,在管道构建过程中系统地比较不同设计选择的平台仍不完善。因此,一个模块化框架对于理解这种方法的优势和局限性以及推动进一步的发展至关重要。 为了应对这些挑战,我们提出了一种能够自动从cryo-EM数据中生成高质量分割图作为真值标签的流程。我们的模块化框架允许选择各种分割模型和损失函数。此外,我们将条件随机场(CRFs)与不同的求解器和特征集集成起来以细化粗略预测,从而产生精细颗粒度的分割效果。这种灵活性有助于根据cryo-EM数据集实现最佳配置。 在仅使用有限数量显微图像训练的情况下,我们的方法在合成数据上实现了超过90%的准确率、召回率、精确率、交并比(IoU)和F1分数。此外,为了展示我们框架在下游分析中的有效性,我们展示了由我们的流程提取的粒子产生的3D密度图比现有颗粒选择器在实际实验数据集上的分辨率更高,并且性能与专家手动注释的数据集相当。
https://arxiv.org/abs/2502.08287
Point cloud registration approaches often fail when the overlap between point clouds is low due to noisy point correspondences. This work introduces a novel cross-attention mechanism tailored for Transformer-based architectures that tackles this problem, by fusing information from coordinates and features at the super-point level between point clouds. This formulation has remained unexplored primarily because it must guarantee rotation and translation invariance since point clouds reside in different and independent reference frames. We integrate the Gromov-Wasserstein distance into the cross-attention formulation to jointly compute distances between points across different point clouds and account for their geometric structure. By doing so, points from two distinct point clouds can attend to each other under arbitrary rigid transformations. At the point level, we also devise a self-attention mechanism that aggregates the local geometric structure information into point features for fine matching. Our formulation boosts the number of inlier correspondences, thereby yielding more precise registration results compared to state-of-the-art approaches. We have conducted an extensive evaluation on 3DMatch, 3DLoMatch, KITTI, and 3DCSR datasets.
点云配准方法在点云重叠度低的情况下常常会失败,这是由于噪声导致的点对应关系不准确。本文介绍了一种新颖的跨注意力机制,专门用于基于Transformer架构的方法来解决这一问题。该机制通过融合不同点云之间超点级别的坐标和特征信息来进行工作。此框架一直未被探索的主要原因是它必须保证旋转和平移不变性,因为点云存在于不同的、独立的参考帧中。我们将在跨注意力公式中集成Gromov-Wasserstein距离,以计算不同点云之间的点距并考虑其几何结构。通过这种方式,在任意刚性变换下,来自两个不同点云的点可以相互关注。在单个点级别上,我们也设计了一种自我注意机制,将局部几何结构信息聚合到点特征中进行精细匹配。我们的方法提高了内点对应关系的数量,从而相比现有方法获得了更为精确的配准结果。我们在3DMatch、3DLoMatch、KITTI和3DCSR数据集上进行了详尽的评估。
https://arxiv.org/abs/2502.08285
This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
本文介绍了FloVD,这是一种基于光学流的视频扩散模型,用于生成可控制摄像机视角的视频。FloVD 利用光学流图来表示相机和移动物体的动作。这种方法提供了两个关键优势:首先,由于光学流可以直接从视频中估计出来,因此我们的方法允许使用任意训练视频而无需真实的摄像机参数。其次,背景光学流编码了不同视点之间的三维关联,因此我们的方法能够通过利用背景运动来进行详细的摄像机控制。 为了在支持详细摄像机控制的同时合成自然物体运动,我们的框架采用了一种两阶段的视频合成流水线,包括光学流生成和基于流条件的视频合成。广泛的实验表明,在精确的摄像机控制和自然物体运动合成方面,我们提出的方法优于先前的方法。
https://arxiv.org/abs/2502.08244
Deep learning can learn high-level semantic features in Euclidean space effectively for PolSAR images, while they need to covert the complex covariance matrix into a feature vector or complex-valued vector as the network input. However, the complex covariance matrices are essentially a complex Hermit positive definite (HPD) matrix endowed in Riemannian manifold rather than Euclidean space. The matrix's real and imagery parts are with the same significance, as the imagery part represents the phase information. The matrix vectorization will destroy the geometric structure and manifold characteristics of complex covariance matrices. To learn complex HPD matrices directly, we propose a Riemannian complex HPD convolution network(HPD\_CNN) for PolSAR images. This method consists of a complex HPD unfolding network(HPDnet) and a CV-3DCNN enhanced network. The proposed complex HPDnet defines the HPD mapping, rectifying and the logEig layers to learn geometric features of complex matrices. In addition, a fast eigenvalue decomposition method is designed to reduce computation burden. Finally, a Riemannian-to-Euclidean enhanced network is defined to enhance contextual information for classification. Experimental results on two real PolSSAR datasets demonstrate the proposed method can achieve superior performance than the state-of-the-art methods especially in heterogeneous regions.
深度学习能够有效地在欧几里得空间中为PolSAR(极化合成孔径雷达)图像学习高级语义特征,然而它们需要将复杂的协方差矩阵转换成特征向量或复值向量作为网络输入。然而,复杂协方差矩阵本质上是黎曼流形上的复赫米特正定(HPD)矩阵,而不是欧几里得空间中的对象。该矩阵的实部和虚部具有相同的重要性,因为虚部代表相位信息。将矩阵矢量化会破坏复数协方差矩阵的几何结构和流形特性。 为了直接学习复杂的HPD矩阵,我们提出了一种用于PolSAR图像的黎曼复HPD卷积网络(HPD\_CNN)。该方法由一个复杂HPD展开网络(HPDnet)和一个增强型CV-3DCNN组成。所提出的复数HPDnet定义了HPD映射、校正以及logEig层,以学习复矩阵的几何特征。此外,还设计了一种快速的本征值分解方法来减少计算负担。最后,定义了一个黎曼到欧几里得增强网络,用于提高分类中的上下文信息。 在两个真实的PolSAR数据集上的实验结果表明,所提出的方法比现有的最先进的方法表现出色,尤其是在异质区域中。
https://arxiv.org/abs/2502.08137
Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the \emph{bearing rate} information that has not been well explored in the literature. The main contribution of this paper is to propose a new cooperative estimator called STT-R (Spatial-Temporal Triangulation with bearing Rate), which is designed under the framework of distributed recursive least squares. This theoretical result is further verified by numerical simulation and real-world experiments. It is shown that the proposed STT-R algorithm can effectively generate more accurate estimations and effectively reduce the lag in velocity estimation, enabling tracking of more maneuverable targets.
基于视觉的目标运动估计是许多机器人任务中的基本问题。现有方法存在观测性低的局限,因此在追踪高度机动目标时面临挑战。受空中目标追逐任务的启发,在该任务中目标可能在三维空间内机动,本文研究了如何通过结合文献中尚未充分探索的“方向率”信息来进一步增强观测性。本文的主要贡献是提出了一种新的协作估计器——STT-R(带有方向率的空间-时间三角测量),它是在分布式递归最小二乘框架下设计的。这一理论成果还通过数值模拟和真实世界实验得到了验证。结果表明,所提出的STT-R算法能够有效生成更准确的估计,并显著减少速度估计中的滞后现象,从而实现对更多高度机动目标的有效追踪。
https://arxiv.org/abs/2502.08089
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.
我们介绍了一种名为Pippo的生成模型,它可以仅从一张随意拍摄的照片中生成分辨率为1K的人体密集旋转视频。Pippo是一个多视角扩散变换器,并不需要任何额外输入(例如,拟合参数模型或输入图像的相机参数)。我们在没有标注的30亿张人类图片上对Pippo进行了预训练,在工作室捕捉到的人类数据集上进行了多视图中期训练和后期训练。在中期训练阶段,为了快速吸收工作室的数据集,我们以低分辨率去噪多个(最多48个)视角,并使用浅层MLP粗略编码目标相机。在后期训练中,我们在高分辨率下对较少的视角进行去噪处理,并采用像素对齐控制(例如空间锚点和Plücker射线),从而实现三维一致生成。 在推理阶段,我们提出了一种注意力偏向技术,使Pippo能够在一次训练中同时生成比训练过程中看到的多五倍以上的视图。最后,我们还引入了一个改进的度量标准来评估多视角生成中的三维一致性,并证明了从单张图像生成多人视角时,Pippo的表现优于现有方法。
https://arxiv.org/abs/2502.07785
We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We will release our code and data upon publication.
我们介绍了一种名为MatSwap的方法,该方法能够将材料转移到图像中指定的表面上,并且效果非常逼真。由于照片中的材质外观、几何结构和光照之间存在复杂的相互作用,这项任务并不容易实现。在相关文献中,大多数材质编辑方法通常依赖于繁琐的文字工程或需要大量手动注释的数据集,这些数据集要求艺术家的专业知识以及难以获取的3D场景属性。 相比之下,我们提出了一种直接学习输入材料(观察到的平坦表面上)与其在场景中的外观之间关系的方法,而无需显式使用UV映射。为了实现这一目标,我们依赖于一个定制的、具有光照和几何感知能力的扩散模型,并且我们将大规模预训练的文字转图像模型进行了微调,用于材质转移任务,同时保持其强大的先验知识以确保对真实图像的有效泛化。 通过这种方法,我们可以将期望的材料无缝地融入到照片的目标位置中,同时保留场景的身份特征。我们在合成和真实图像上评估了该方法,并且无论是在定性还是定量方面都显示出了优于最近工作的优势。我们将在论文发表后发布代码和数据集。
https://arxiv.org/abs/2502.07784
Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at this https URL.
单目第一人称视角的3D人体动作捕捉仍然是一个重大挑战,特别是在低光照和快速运动条件下,这些条件在头戴式设备应用中非常常见。现有的依赖RGB摄像头的方法在这种情况下往往效果不佳。为了克服这些限制,我们引入了EventEgo3D++,这是首个利用单目事件相机(配备鱼眼镜头)进行3D人体动作捕捉的技术方法。由于其高时间分辨率,事件相机在高速场景和变化光照条件下表现出色,能够提供准确的3D人体运动捕捉所需的可靠线索。EventEgo3D++通过利用事件流的LNES表示法来实现精确的三维重建。我们还开发了一款配备事件摄像头的移动头戴式设备(HMD)原型机,并采集了一个全面的数据集,其中包括从受控工作室环境和野外设置中收集的真实事件观察数据以及合成数据集。为了提供一个更为综合的数据集,我们也加入了以不同视角捕捉HMD佩戴者的第一人称RGB视频流,同时包含与其对应的SMPL人体模型。 我们的实验表明,EventEgo3D++在各种挑战条件下实现了比现有解决方案更优的三维精度和鲁棒性,并且能够支持每秒140帧的速度实时更新三维姿态。这项工作是针对CVPR 2024年提出的方法——EventEgo3D的进一步发展,在第一人称视角下的人体运动捕捉领域推进了技术前沿。 欲了解更多信息,请访问项目主页:[此链接](https://example.com/project-page)(请将"this https URL"替换为实际链接)。
https://arxiv.org/abs/2502.07869
Gaussian Splatting (GS) is a recent and pivotal technique in 3D computer graphics. GS-based algorithms almost always bypass classical methods such as ray tracing, which offers numerous inherent advantages for rendering. For example, ray tracing is able to handle incoherent rays for advanced lighting effects, including shadows and reflections. To address this limitation, we introduce MeshSplats, a method which converts GS to a mesh-like format. Following the completion of training, MeshSplats transforms Gaussian elements into mesh faces, enabling rendering using ray tracing methods with all their associated benefits. Our model can be utilized immediately following transformation, yielding a mesh of slightly reduced quality without additional training. Furthermore, we can enhance the reconstruction quality through the application of a dedicated optimization algorithm that operates on mesh faces rather than Gaussian components. The efficacy of our method is substantiated by experimental results, underscoring its extensive applications in computer graphics and image processing.
高斯点阵(GS)是三维计算机图形学中的一项近期且关键的技术。基于GS的算法几乎总是绕过了传统的光线追踪等方法,而后者在渲染方面具有诸多固有的优势。例如,光线追踪能够处理非相干光束,以实现包括阴影和反射在内的高级光照效果。为了克服这一局限性,我们引入了MeshSplats,这是一种将GS转换为类似网格格式的方法。训练完成后,MeshSplats可以将高斯元素转化为网格面片,从而启用使用具有所有相关优点的光线追踪方法进行渲染。我们的模型在转换后可以直接使用,并且能够生成质量略低但无需额外训练的网格。 此外,我们可以通过应用专门针对网格面片而非高斯成分的操作来改进重建的质量。通过实验结果验证了我们方法的有效性,这进一步强调了其在计算机图形学和图像处理中的广泛应用潜力。
https://arxiv.org/abs/2502.07754
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: this https URL.
我们介绍了Matrix3D,这是一种统一模型,能够执行多项摄影测量子任务,包括姿态估计、深度预测和新视角合成,并且仅使用同一个模型即可完成。Matrix3D利用多模态扩散变换器(DiT)来整合图像、相机参数和深度图等不同模态之间的转换。Matrix3D大规模多模态训练的关键在于引入了掩码学习策略,这使得即使在数据不完整的情况下(例如仅有的图像-姿态对或图像-深度对),也能进行全模态模型训练,从而大大增加了可用的训练数据量。在姿态估计和新视角合成任务中,Matrix3D展现了最先进的性能,并且通过多轮互动提供了精细控制,使其成为三维内容创建中的创新工具。项目主页:[此链接](https://this-url.com/)。
https://arxiv.org/abs/2502.07685