We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.
我们介绍了一种名为变形高斯点阵大型重建模型(Deformable Gaussian Splats Large Reconstruction Model,简称 DGS-LRM)的方法。这是第一个能够从前视单目视频预测动态场景中可变形三维高斯点阵的前馈方法。前馈场景重建因其能快速创建真实世界环境的数字复制品而受到了广泛的关注。然而,大多数现有的模型仅限于静态场景,并且无法重构移动物体的动作。开发一种适用于动态场景重构的前馈模型面临着诸多挑战,包括训练数据稀少以及需要适当的三维表示和训练范式的问题。为了解决这些挑战,我们引入了几个关键技术贡献:一个增强的大规模合成数据集,该数据集中包含了多视角真值视频和密集的三维场景流监督;一种像素级别的可变形三维高斯表征方法,易于学习、支持高质量动态视图合成,并且能够实现长距离三维跟踪。此外还有一个大型变压器网络,实现了实时通用的动态场景重建。 大量的定性和定量实验表明,DGS-LRM在动态场景重构质量上与基于优化的方法相当,同时显著优于现有的最先进的预测性动态重建方法在真实世界示例上的表现。它的物理基础三维变形预测准确,并且能够轻松适应长距离三维跟踪任务,其性能可媲美当前最先进的单目视频三维跟踪方法。
https://arxiv.org/abs/2506.09997
If human experience is any guide, operating effectively in unstructured environments -- like homes and offices -- requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation -- and in many cases, to force-unaware, sensorless approaches. With eFlesh, we bridge this gap by introducing a magnetic tactile sensor that is low-cost, easy to fabricate, and highly customizable. Building an eFlesh sensor requires only four components: a hobbyist 3D printer, off-the-shelf magnets (<$5), a CAD model of the desired shape, and a magnetometer circuit board. The sensor is constructed from tiled, parameterized microstructures, which allow for tuning the sensor's geometry and its mechanical response. We provide an open-source design tool that converts convex OBJ/STL files into 3D-printable STLs for fabrication. This modular design framework enables users to create application-specific sensors, and to adjust sensitivity depending on the task. Our sensor characterization experiments demonstrate the capabilities of eFlesh: contact localization RMSE of 0.5 mm, and force prediction RMSE of 0.27 N for normal force and 0.12 N for shear force. We also present a learned slip detection model that generalizes to unseen objects with 95% accuracy, and visuotactile control policies that improve manipulation performance by 40% over vision-only baselines -- achieving 91% average success rate for four precise tasks that require sub-mm accuracy for successful completion. All design files, code and the CAD-to-eFlesh STL conversion tool are open-sourced and available on this https URL.
如果以人类经验为参考,要在无结构环境中(如家庭和办公室)有效地操作机器人需要让机器人能够感知物理互动中的力。然而,由于缺乏一种灵活、易得且易于定制的触觉传感器,导致了在机器人操控领域出现了碎片化、特定于某种传感器的解决方案,并且在许多情况下采取了忽视力感应的无感器方法。通过推出eFlesh,我们填补了这一空白,这是一种成本低廉、容易制造和高度可定制的磁性触觉传感器。 构建一个eFlesh传感器只需要四个组件:一台业余爱好者级别的3D打印机、现成的磁铁(不到5美元)、所需形状的CAD模型以及一块磁力计电路板。该传感器由排列整齐的参数化微结构组成,允许用户调节传感器的几何形状及其机械响应。我们提供了一个开源设计工具,可将凸形OBJ/STL文件转换为用于制造的3D打印STL格式。 这种模块化的设计框架使用户能够创建适用于特定应用的传感器,并根据任务需求调整灵敏度。我们的传感器特性实验展示了eFlesh的能力:接触定位均方根误差(RMSE)为0.5毫米,法向力预测RMSE为0.27牛顿,剪切力预测RMSE为0.12牛顿。 我们还提出了一种通过学习来检测滑动的模型,该模型能够以95%的准确率应用于未知物体,并且视觉触觉控制策略比单纯基于视觉的基础线提高了40%的操作性能——实现了平均成功率为91%的成绩,这四个精确任务需要毫米级精度才能完成。 所有设计文件、代码以及CAD到eFlesh STL转换工具均开源并可在此链接访问:[https URL]。
https://arxiv.org/abs/2506.09994
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: this https URL
我们研究了通过预测人类手部与场景物理互动产生的声音,来使3D场景重建具有交互性的问题。具体来说,首先记录一个人用手在三维场景中操作物体的视频。然后使用这些动作-声音对来训练一个校正流模型,以将3D手部轨迹映射到相应的音频上。在测试阶段,用户可以向该模型查询其他动作(参数化为一系列手部姿态)并估算其对应的声响。实验结果显示,我们生成的声音能够准确传达材料特性和动作,并且这些声音经常与人类观察者难以区分真正的实际声音。项目页面:[此处应填写实际的网址链接]
https://arxiv.org/abs/2506.09989
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
最近在4D内容生成领域的进展引起了越来越多的关注,但创建高质量的动画3D模型仍然具有挑战性,这主要是由于建模时空分布的复杂性和缺乏4D训练数据。在这篇论文中,我们提出了AnimateAnyMesh,这是第一个能够以文本驱动的方式高效地对任意3D网格进行动画处理的前馈框架。我们的方法利用了一种新颖的DyMeshVAE架构,在这个架构中,通过解耦空间和时间特征,并保持局部拓扑结构不变,可以有效地压缩并重构动态网格序列。为了实现高质量的条件生成,我们在压缩后的潜在空间中采用了一种基于修正流(Rectified Flow)的训练策略。 此外,我们贡献了一个包含超过400万个多样化且带有文本注释的动态网格序列数据集——DyMesh数据集。实验结果表明,我们的方法能够在几秒钟内生成语义准确、时间连贯的网格动画,并在质量和效率上显著优于现有的方法。我们的工作标志着使4D内容创建变得更加可访问和实用方面的一个重要进展。所有数据、代码和模型将在开源平台上开放发布。
https://arxiv.org/abs/2506.09982
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
近期在3D物体生成领域的进展显著提高了质量和效率。然而,大多数现有的方法会生成一个将所有部分融合在一起的单一网格模型,这限制了对单个部分进行编辑或操作的能力。关键挑战在于不同的对象可能具有不同数量的部分。为了解决这个问题,我们提出了一种新的端到端框架,用于在部件级别生成3D物体。给定一张输入图像,我们的方法可以生成包含任意数量的完整且语义上相关的部分的高质量3D物体。 为了实现这一点,我们引入了双体积打包策略,该策略将所有部分组织成两个互补的体积中,从而允许创建完整而交错的部分并组装成最终的对象。实验表明,与之前的基于图像的部件级别生成方法相比,我们的模型在质量、多样性和泛化能力方面均表现出色。
https://arxiv.org/abs/2506.09980
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at this https URL.
点云数据的尺度多样性为三维视觉中的统一表示学习技术的发展带来了显著挑战。目前,很少有通用的3D模型存在,并且没有现有的预训练方法能够同时有效地应用于对象级和场景级点云。在本文中,我们介绍了UniPre3D,这是首个可以无缝应用于任何规模点云及任意架构3D模型的统一预训练方法。我们的方法将预测高斯基元作为预训练任务,并采用可微分高斯渲染技术来生成图像,从而实现精确的像素级监督和端到端优化。为了进一步调节预训练任务的复杂度并引导模型关注几何结构,我们整合了来自预先训练好的图像模型的2D特征,以纳入已确立的良好纹理知识。我们通过广泛的实验验证了所提出方法在各种对象级和场景级任务中的通用有效性,并使用多种点云模型作为骨干网络进行测试。代码可在提供的链接中获取。
https://arxiv.org/abs/2506.09952
Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.
开发能够理解三维场景并根据自然语言指令执行广泛任务的三维视觉-语言(3D-VL)通才,一直是3D-VL社区长期追求的目标。尽管近期取得了进展,但3D-VL模型在能力和稳健性方面仍落后于其二维对应模型,并未能达到通才的标准。开发3D-VL通才的关键障碍在于数据规模的扩展问题,而这一问题又受到缺乏高效场景表示方法的限制。我们提出了LEO-VL,这是一个基于凝缩特征网格(CFG)构建的三维视觉-语言模型,CFG是一种高效的场景表示方式,它在桥接二维感知和三维空间结构的同时显著减少了令牌开销。这种效率为大规模训练3D-VL通才打开了大门,为此我们整理了超过70万条高质量的涵盖四个真实室内场景领域和五个任务(如描述生成和对话)的数据集。 LEO-VL在各种3D问答基准测试中均达到了最先进的性能水平,包括SQA3D、MSQA和Beacon3D。通过消融研究确认了我们表示法的有效性、任务与场景多样性的重要性以及数据整理原则的合理性。此外,我们引入了一种新颖的事后训练目标——SceneDPO,旨在增强3D-VL模型的稳健性。 希望我们的研究成果能够为开发可扩展和稳健的三维视觉-语言通才贡献一份力量。
https://arxiv.org/abs/2506.09935
Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on rotation prediction, achieving a supervised-level $R^2$ of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures.
学习对变换保持不变性和等变性的自监督表示对于超越传统的视觉分类任务至关重要。然而,许多方法依赖于预测器架构来编码等变性,尽管有证据表明像胶囊网络这样的架构选择本身在学习可解释的姿势感知表示方面表现出色。为了探索这一点,我们引入了EquiCaps(等变胶囊网络),这是一种基于胶囊的方法,用于姿势感知自监督,并消除了需要专门的预测器以强制执行等变性的需求。相反,我们利用胶囊固有的姿势感知能力来提高姿态估计任务中的性能。 为进一步挑战我们的假设,通过多几何变换增加任务复杂性,引入3DIEBench-T,一个3D物体渲染基准数据集的扩展版本,以便更全面地评估不变性和等变性。实证结果表明,EquiCaps在旋转预测上优于先前的最佳等变方法,在3DIEBench旋转预测基准测试中实现了与监督水平相当的$R^2$得分为0.78,并分别比SIE和CapsIE提高了0.05和0.04的$R^2$得分。此外,与非胶囊基等变方法相比,EquiCaps在结合了几何变换的情况下仍保持了稳健的等变性能,这突显了其泛化能力和无预测器胶囊架构的潜力。
https://arxiv.org/abs/2506.09895
We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: this https URL .
我们考虑的是通用的新视角合成(NVS)问题,其目标是从稀疏或未指定2D图像生成逼真的新视图,而不依赖于每个场景的单独优化。这项任务在本质上仍然具有挑战性,因为需要从不完整且模棱两可的二维观察中推断三维结构。早期的方法通常依赖于强大的三维知识,包括架构上的三维归纳偏差(例如,在网络设计中嵌入显式的3D表示,如NeRF或3DGS)和输入视图及目标视图的真实相机姿态信息。 虽然近期的研究致力于减少对三维知识的依赖或对已知摄像机位置的依赖,但关于三维知识的作用以及绕过其使用是否必要等问题仍然未得到充分探讨。在这项工作中,我们系统地分析了三维知识,并发现了一个关键趋势:那些需要较少三维知识的方法在数据量增加时性能提升更快,最终可以达到与基于三维知识方法相当的表现水平,这表明减少对三维信息的依赖在大规模数据时代变得越来越重要。 受这一趋势启发,我们提出了一种新的NVS框架,该框架大大减少了输入和目标视图中的3D归纳偏差及姿态依赖。通过去除这种3D知识,我们的方法能够充分利用数据量的增长,并直接从稀疏的2D图像中学习隐式的三维感知能力,在训练过程中无需任何3D归纳偏差或姿态标注。 广泛的实验表明,我们开发的模型可以生成逼真的、与3D一致的新视角视图,其性能甚至可与依赖于已定位输入的方法相媲美,从而验证了我们的数据为中心范式在可行性和有效性方面的潜力。项目页面:[此处插入实际链接]
https://arxiv.org/abs/2506.09885
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
视觉-语言模型(VLMs)在各种视觉和语言任务中展现了卓越的性能,但它们仍然在理解三维空间结构方面存在根本性的局限。我们提出了一种轻量级且无需标注的微调框架——几何蒸馏(Geometric Distillation),该框架可以在不改变预训练VLM架构的情况下,将人类启发式的几何线索注入模型。通过从现成的3D基础模型(如MASt3R、VGGT)中提取并传递稀疏对应关系、相对深度关系和密集成本体积等信息,我们的方法使表示能够具备空间感知能力,并且保持与自然图像-文本输入兼容。 在广泛的三维视觉语言推理和三维感知基准测试中,我们所提出的方法始终优于先前的方法,在显著降低计算成本的同时实现了更好的三维空间推理性能。这项工作展示了将二维训练的VLMs与三维理解相结合的一种可扩展且高效的路径,并为基于空间的任务提供了更广泛的应用前景。
https://arxiv.org/abs/2506.09883
While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation -- leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines -- enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.
尽管文本到3D生成技术越来越受到关注,现有的方法往往难以产生与人类偏好相匹配的高质量三维资产。目前用于3D内容的偏好对齐技术通常依赖于收集难度大的成对偏好的多视角2D图像来训练二维奖励模型,然后指导3D生成——这会导致由于其固有的二维偏向而产生的几何伪影。为了解决这些问题,我们构建了3D-MeshPref,这是第一个大规模无配对的三维偏好数据集,该数据集包含了由大型语言模型标注并经过人类评估员优化的多样化3D网格。随后,我们开发了RewardCS,这是首个直接在未配对的3D-MeshPref数据上训练的奖励模型,使用了一种新颖的柯西-施瓦茨散度目标函数,从而能够在无需成对比较的情况下有效地学习与人类偏好一致的三维几何偏好。在此基础上,我们提出了DreamCS,这是一个统一的框架,将RewardCS集成到了文本到3D的生成流程中——增强了隐式和显式的三维生成过程,并通过人类偏好反馈提高了其质量。大量的实验表明,DreamCS超越了先前的方法,在产生既具有几何保真度又符合人类偏好的三维资产方面表现突出。代码和模型将在公开渠道发布。
https://arxiv.org/abs/2506.09814
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
从RGBD数据估算物体的6D姿态是计算机视觉中的一个基本问题,具有在机器人技术和增强现实领域的应用。其中一个关键挑战是在训练过程中未见过的新物体上的泛化能力。大多数现有方法通过增加合成数据(针对特定任务定制)的训练来解决这一问题,这需要大量的计算资源。然而,对于新对象的6D姿态估计而言,是否真的需要特定于任务的培训就能达到准确且高效的程度?为了回答“不”的观点,我们引入了FreeZeV2——这是FreeZe的第二代产品:一种无需训练的方法,通过利用预训练在无关数据上的几何和视觉基础模型来实现对未见过物体的强大泛化能力。相较于最初的FreeZe,FreeZeV2通过以下三个关键贡献提高了准确性和效率: (i) 一种稀疏特征提取策略,在不牺牲准确性的情况下减少了推理时间的计算量; (ii) 一个基于特征感知评分机制,改进了RANSAC(随机采样一致性)基础的3D注册过程中姿态选择以及最终姿势候选者排名的质量; (iii) 一个模块化设计,支持实例分割模型的集合,增强了对分割掩码错误的鲁棒性。 我们在BOP基准测试的七个核心数据集上评估了FreeZeV2,在这些数据集中,它建立了6D姿态估计的新最先进水平。在使用相同的分割掩码时,FreeZeV2比FreeZe的速度提高了8倍,并且准确度提高了5%。当使用集合的分割模型时,FreeZeV2获得了额外8%的准确性,同时仍然比FreeZe快2.5倍。FreeZeV2在BOP挑战赛2024中荣获最佳整体方法奖。
https://arxiv.org/abs/2506.09784
Micro-expression recognition (MER), a critical subfield of affective computing, presents greater challenges than macro-expression recognition due to its brief duration and low intensity. While incorporating prior knowledge has been shown to enhance MER performance, existing methods predominantly rely on simplistic, singular sources of prior knowledge, failing to fully exploit multi-source information. This paper introduces the Multi-Prior Fusion Network (MPFNet), leveraging a progressive training strategy to optimize MER tasks. We propose two complementary encoders: the Generic Feature Encoder (GFE) and the Advanced Feature Encoder (AFE), both based on Inflated 3D ConvNets (I3D) with Coordinate Attention (CA) mechanisms, to improve the model's ability to capture spatiotemporal and channel-specific features. Inspired by developmental psychology, we present two variants of MPFNet--MPFNet-P and MPFNet-C--corresponding to two fundamental modes of infant cognitive development: parallel and hierarchical processing. These variants enable the evaluation of different strategies for integrating prior knowledge. Extensive experiments demonstrate that MPFNet significantly improves MER accuracy while maintaining balanced performance across categories, achieving accuracies of 0.811, 0.924, and 0.857 on the SMIC, CASME II, and SAMM datasets, respectively. To the best of our knowledge, our approach achieves state-of-the-art performance on the SMIC and SAMM datasets.
微表情识别(MER)是情感计算中的一个重要子领域,相较于宏观表达识别,它面临更大的挑战,因为其持续时间短且强度低。虽然将先验知识纳入模型已被证明能够提升MER性能,但现有的方法主要依赖于单一、简单的先验知识来源,未能充分利用多源信息的优势。本文提出了一个名为多先验融合网络(MPFNet)的框架,并采用了一种渐进式训练策略来优化MER任务。我们设计了两个互补的编码器:通用特征编码器(GFE)和高级特征编码器(AFE),两者均基于配备了坐标注意力机制的.inflate 3D卷积网络(I3D),以此提升模型捕捉时空和通道特异性的能力。 受发展心理学的启发,本文提出了MPFNet框架下的两种变体:MPFNet-P与MPFNet-C。这两种变体分别对应于婴儿认知发展中两大基本模式——并行处理和层次化处理方式,从而能够评估不同先验知识整合策略的效果。 实验结果表明,在SMIC、CASME II及SAMM数据集上,我们的方法能够显著提升MER的准确率,并且在各类别之间保持均衡表现。具体而言,我们模型在这三个数据集上的准确性分别为0.811、0.924和0.857。据我们所知,在SMIC与SAMM这两个数据集中,我们的方法实现了迄今为止的最佳性能。
https://arxiv.org/abs/2506.09735
Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.
在三维环境中对复杂物体进行精确的六维姿态估计对于有效的机器人操作至关重要。然而,现有的基准测试方法未能充分评估六维姿态估计技术在现实工业条件下的性能,因为大多数数据集主要关注家庭环境中的日常物品,而为数不多的工业数据集中对象又仅限于放置在桌子上的人工设置环境中。为了填补这一空白,我们引入了CHIP(椅子的真实世界工业环境六维姿态估计数据集),这是首个专为机器人手臂操作椅子设计的数据集。 CHIP涵盖了七种不同的椅子,并使用三种不同类型的RGBD传感器技术进行捕捉。该数据集中包含了一些独特的挑战,如具有细微差异的干扰物体以及由于机器人机械臂和人工操作员造成的严重遮挡问题。此外,CHIP包含了77,811张标注了真实六维姿态信息的RGBD图像,这些注释基于机器人动力学自动产生,并且平均每把椅子有超过11,115个注释。 为了评估CHIP的表现,我们使用三种零样本学习(zero-shot learning)的六维姿态估计方法,在不同的传感器类型、定位先验条件以及遮挡水平下进行基准测试。实验结果显示,现有的算法还有很大的改进空间,并突显了该数据集的独特挑战性。未来计划将CHIP公开发布。
https://arxiv.org/abs/2506.09699
Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at this https URL.
阿尔茨海默病(AD)的早期诊断,尤其是在轻度认知障碍(MCI)阶段,非常重要,但受到主观评估和多模态影像手段高昂成本的阻碍。尽管深度学习方法提供了自动化的替代方案,但由于其能耗效率低和计算需求高,在实际部署中特别是在资源受限环境下受到了限制。作为受大脑启发的一种范式,脉冲神经网络(SNN)非常适合模拟AD患者稀疏、事件驱动式的神经退行性模式,为可解释性和低功耗的医学诊断提供了有希望的基础。然而,现有的SNN通常存在表达力弱和训练不稳定的问题,这限制了它们在复杂医疗任务中的有效性。 为了克服这些局限性,我们提出了FasterSNN,这是一种混合神经网络架构,结合了生物启发式的LIF(Leaky Integrate-and-Fire)神经元与区域自适应卷积和多尺度脉冲注意机制。这种设计能够实现3D MRI的稀疏且高效的处理,并保持诊断准确性。在基准数据集上的实验表明,FasterSNN实现了竞争性的性能,并显著提高了效率和稳定性,这支持了其在实际AD筛查中的应用潜力。 我们的源代码可以在以下网址获得:[这里应填入具体的URL链接]
https://arxiv.org/abs/2506.09695
We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.
我们利用经过微调的视频扩散模型、视频内在属性分解以及基于物理的可微渲染技术,根据文本提示或单张图片生成高质量的3D模型材质。首先,我们将视频扩散模型条件化以遵循输入几何形状和光照条件,从而生成给定3D模型的多个视角,并确保这些视图具有连贯一致的材料特性。其次,我们使用最近开发的一种模型从生成的视频中提取内在属性(如基础颜色、粗糙度、金属质感)。最后,我们将这些内在属性与生成的视频一起用于可微路径追踪器,从而稳健地提取出直接兼容常见内容创作工具的PBR材质。
https://arxiv.org/abs/2506.09665
Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deformable 3D Gaussian fields, embedding geometry, appearance, and motion in one compact representation. Each interaction state is modeled as a smooth deformation of a shared field, and the resulting deformation trajectories guide a progressive coarse-to-fine part segmentation that identifies distinct rigid components, all in an unsupervised manner. The refined field provides a spatially continuous, fully decoupled description of every part, supporting part-level reconstruction and precise modeling of their kinematic relationships. To evaluate generalization and realism, we enlarge the synthetic PartNet-Mobility benchmark and release RS-Art, a real-to-sim dataset that pairs RGB captures with accurately reverse-engineered 3D models. Extensive experiments demonstrate that our method outperforms existing methods in both accuracy and stability.
在日常生活中,可动部件的物体随处可见。为了众多应用,这些物体几何形状和运动的精确三维表示至关重要。然而,在缺乏人工标注的情况下,现有的方法仍然难以构建包含多个可移动部分物体的一致性表示。我们引入了DeGSS,这是一个统一框架,将可动对象编码为变形的3D高斯场,同时嵌入了几何、外观及运动的一个紧凑表达形式。每个交互状态都作为共享字段的平滑形变被建模,并且由此产生的形变轨迹引导了一种渐进式的粗到细部件分割方法,以无监督的方式识别出各个独立的刚性组件。细化后的场提供了每个部分在空间上连续、完全解耦的描述,支持基于部件级别的重建及精确地对其运动学关系进行建模。 为了评估泛化能力和现实主义,我们扩展了合成数据集PartNet-Mobility,并发布了RS-Art数据集,该数据集将RGB捕捉图像与准确逆向工程生成的真实三维模型配对。广泛的实验表明,在准确性与稳定性方面,我们的方法优于现有的所有方法。
https://arxiv.org/abs/2506.09663
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at this https URL.
自动化3D CT诊断赋能临床医生通过提高诊断准确性和工作流程效率,及时做出基于证据的决策。尽管多模态大型语言模型(MLLM)在视觉-语言理解方面表现出色,但现有方法主要关注二维医学图像,这从根本上限制了它们捕捉复杂三维解剖结构的能力。这种局限性常常导致对细微病理特征的误解,并引发诊断错误。在这篇论文中,我们提出了Hybrid Spatial Encoding Network (HSENet),这是一个利用有效视觉感知和投影丰富3D医学视觉线索以实现准确且鲁棒的视觉-语言理解的框架。具体而言,HSENet采用双三维视觉编码器来感知全局体积上下文以及精细解剖细节,并通过双重阶段对齐与诊断报告进行预训练。此外,我们提出了一种高效的多模态投影器——Spatial Packer,该模型通过基于质心压缩将高分辨率3D空间区域浓缩为一系列信息丰富的视觉标记。通过将Spatial Packer与双三维视觉编码器相结合,HSENet可以无缝地感知和转移混合视觉表示到LLM的语义空间中,从而促进准确的诊断文本生成。实验结果显示,我们的方法在3D语言-视觉检索(R@100为39.85%,提升率为+5.96%)、3D医学报告生成(BLEU-4得分为24.01%,提升率为+8.01%)和3D视觉问答(Major Class Accuracy得分73.60%,提升率为+1.99%)中均达到了最先进的性能,证实了其有效性。我们的代码可在该链接获得。
https://arxiv.org/abs/2506.09634
Localization plays a crucial role in the navigation capabilities of autonomous robots, and while indoor environments can rely on wheel odometry and 2D LiDAR-based mapping, outdoor settings such as agriculture and forestry, present unique challenges that necessitate real-time localization and consistent mapping. Addressing this need, this paper introduces the VAULT prototype, a ROS 2-based mobile mapping system (MMS) that combines various sensors to enable robust outdoor and indoor localization. The proposed solution harnesses the power of Global Navigation Satellite System (GNSS) data, visual-inertial odometry (VIO), inertial measurement unit (IMU) data, and the Extended Kalman Filter (EKF) to generate reliable 3D odometry. To further enhance the localization accuracy, Visual SLAM (VSLAM) is employed, resulting in the creation of a comprehensive 3D point cloud map. By leveraging these sensor technologies and advanced algorithms, the prototype offers a comprehensive solution for outdoor localization in autonomous mobile robots, enabling them to navigate and map their surroundings with confidence and precision.
本地化在自主机器人的导航能力中扮演着至关重要的角色。虽然室内环境可以依赖于车轮里程计和基于2D LiDAR的地图构建,但诸如农业和林业等室外场景则提出了实时定位和持续地图更新的独特挑战。为了解决这一需求,本文介绍了VAULT原型系统,这是一个基于ROS 2的移动测绘系统(MMS),它结合了多种传感器以实现稳健的室内外定位能力。 该方案利用全球导航卫星系统(GNSS)数据、视觉惯性里程计(VIO)、惯性测量单元(IMU)数据以及扩展卡尔曼滤波器(EKF)来生成可靠的3D里程估计。为了进一步提高定位精度,采用了视觉同步定位与建图技术(VSLAM),从而创建了一个全面的3D点云地图。 通过利用这些传感器技术和高级算法,该原型为自主移动机器人的室外定位提供了一整套解决方案,使它们能够在周围环境中自信而精确地导航和绘制地图。
https://arxiv.org/abs/2506.09583
Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at this https URL.
https://arxiv.org/abs/2506.09565