We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at this https URL.
我们踏上了一个古老的探索之旅:揭示物体从仅仅瞥见其可见部分的微小视角中隐藏的维度。为了解决这个问题,我们推出了Vista3D,一个在短短5分钟内实现快速且一致3D生成的框架。Vista3D的核心是两个阶段的方法:粗阶段和细阶段。在粗阶段,我们通过高斯平滑从单个图像中快速生成初始几何。在细阶段,我们直接从学习的高斯平滑中提取Signed Distance Function(SDF),并通过不同的可微表面表示来优化它。此外,它通过角扩散 prior 使生成质量得到提高,同时通过角扩散 prior 组合来与3D感知扩散 prior 和谐。通过广泛的评估,我们证明了Vista3D在生成3D物体的一致性和多样性之间取得了良好的平衡。演示和代码将在这个https URL上提供。
https://arxiv.org/abs/2409.12193
Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively.
捆绑调整(BA)是一种关键的技术,在各种机器人应用中都有广泛的应用,如同时定位与映射(SLAM)、增强现实(AR)和摄影测量。BA优化参数,如相机姿态和3D地标,以与观测值对齐。随着深度学习在感知系统中的重要性不断增加,越来越多的需要将BA与深度学习框架集成以提高可靠性和性能。然而,广泛使用的基于C++的BA框架,如GTSAM、g$^2$o和Ceres,与现代深度学习库(如PyTorch)的本地集成缺乏。这一限制影响了它们的灵活性、适应性、调试难度和整体实现效率。为了填补这一空白,我们引入了一个与PyPose无缝集成的 eager-mode BA框架,提供与PyTorch兼容的接口,具有高效率。我们的方法包括为2阶优化、Lie组和Lie代数运算以及线性求解设计的GPU加速、可导和稀疏操作。我们的 eager-mode BA在GPU 上具有实质性的运行效率,与 GTSAM、g$^2$o 和 Ceres 分别相比,平均速度提升为 18.5$\times$、22$\times$ 和 23$\times$。
https://arxiv.org/abs/2409.12190
Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.
预测长期3D人动是具有挑战性的:人类行为的随机性使得从输入序列 alone 生成逼真的人类运动很难。关于场景环境和附近人的运动信息可以大大提高生成过程。我们提出了一种场景感知的社交Transformer模型(SAST)来预测长期(10秒)人类运动。与之前的方法不同,我们的方法可以建模场景中广泛变化的数量人和物的相互作用。我们结合了时间卷积编码器-解码器架构和基于Transformer的瓶颈,可以有效地结合运动和场景信息。我们使用去噪扩散模型建模条件运动分布。我们在Kitchens数据集上对人类进行基准测试,该数据集包含1到16个人员和29到50个同时可见的物体。我们的模型在不同的指标和用户研究中比其他方法在逼真性和多样性方面表现出色。代码可在此处访问:https://thisurl.com/
https://arxiv.org/abs/2409.12189
Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at this https URL.
预测场景的语义和3D结构对机器人安全导航和规划任务至关重要。最近的方法探索了语义和凝视场景预测;然而,它们没有考虑场景的几何结构。在这项工作中,我们提出了一个凝视深度预测任务,共同预测未观察到的未来帧的凝视分割和深度图。为了便利这项工作,我们扩展了KITTI-360和Cityscapes基准,通过计算从激光雷达点云中提取深度图,并利用序列标注数据。我们还引入了一个合适的评估指标,以合理地量化预测的凝视质量和深度估计准确性。此外,我们提出了PDcast架构,通过将自注意力机制编码器、预测模块和任务特定解码器相结合,从预测未来的凝视深度输出。广泛的评估表明,PDcast在两个数据集和三个预测任务上都具有有效性,始终针对主要挑战。我们将代码公开发布在此链接:https://github.com/PDcast-AI/PDcast
https://arxiv.org/abs/2409.12008
We present in this paper a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. This disentanglement is achieved by controlling a Neural Radiance Field's appearance with a multi-style hypernetwork. Once trained, the implicit neural representation serves as a differentiable rendering engine, which can be used to estimate the surgical camera pose by minimizing the dissimilarity between its rendered images and the target intraoperative image. We tested our method on retrospective patients' data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration. Code and additional resources can be found at this https URL.
我们在本文中提出了一种通过跨模态逆向神经渲染在神经外科手术中实现3D/2D体内对齐的新方法。我们的方法将隐含神经表示分离成两个部分,预操作过程中处理解剖结构,操作过程中处理外观。这种解开通过控制神经辐射场的出现方式来实现。一旦训练完成,隐含神经表示成为了一个可导的渲染引擎,可以用于通过最小化其渲染图像与目标体内术图像之间的差异来估计手术相机姿态。我们在临床案例的回顾性数据上测试了我们的方法,结果表明我们的方法在超越现有技术水平的同时满足了当前的注册标准。代码和其他资源可以在此链接中找到。
https://arxiv.org/abs/2409.11983
Representing the 3D environment with instance-aware semantic and geometric information is crucial for interaction-aware robots in dynamic environments. Nonetheless, creating such a representation poses challenges due to sensor noise, instance segmentation and tracking errors, and the objects' dynamic motion. This paper introduces a novel particle-based instance-aware semantic occupancy map to tackle these challenges. Particles with an augmented instance state are used to estimate the Probability Hypothesis Density (PHD) of the objects and implicitly model the environment. Utilizing a State-augmented Sequential Monte Carlo PHD (S$^2$MC-PHD) filter, these particles are updated to jointly estimate occupancy status, semantic, and instance IDs, mitigating noise. Additionally, a memory module is adopted to enhance the map's responsiveness to previously observed objects. Experimental results on the Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses state-of-the-art methods across multiple metrics under different noise conditions. Subsequent tests using real-world data further validate the effectiveness of the proposed approach.
用实例感知的语义和几何信息来表示3D环境对于动态环境中的交互式机器人至关重要。然而,创建这种表示由于传感器噪音、实例分割和跟踪误差以及对象的动态运动等问题而带来了挑战。本文提出了一种新型的基于粒子的实例感知的语义占有图来解决这些挑战。使用具有增强实例状态的粒子来估计对象的概率假设密度(PHD),并隐含地建模环境。利用状态增强的随序蒙特卡洛 PHD(S$^2$MC-PHD)滤波器,这些粒子被更新以共同估计占有状态、语义和实例ID,减轻噪声。此外,还采用了一个记忆模块来增强地图对之前观察过的物体的响应。在Virtual KITTI 2数据集上的实验结果表明,与不同噪声条件下的最先进方法相比,所提出的方法在多个指标上超越了最先进方法。后续使用真实世界数据进行的测试进一步验证了所提出方法的有效性。
https://arxiv.org/abs/2409.11975
Recent studies suggest a potential link between the physical structure of mitochondria and neurodegenerative diseases. With advances in Electron Microscopy techniques, it has become possible to visualize the boundary and internal membrane structures of mitochondria in detail. It is crucial to automatically segment mitochondria from these images to investigate the relationship between mitochondria and diseases. In this paper, we present a software solution for mitochondrial segmentation, highlighting mitochondria boundaries in electron microscopy tomography images and generating corresponding 3D meshes.
近年来,研究表明,线粒体的形态结构与神经退行性疾病之间存在潜在联系。随着电子显微镜技术的进步,已经能够详细可视化线粒体的边界和内部膜结构。从这些图像中自动分割线粒体对于研究线粒体与疾病之间的关系至关重要。在本文中,我们提出了一个用于线粒体分割的软件解决方案,重点关注电子显微镜断层图像中的线粒体边界,并生成相应的3D网格。
https://arxiv.org/abs/2409.11974
Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.
实时渲染人头 Avatar 是许多计算机图形应用程序的基础,如增强现实、游戏和电影等。最近的方法通过在精心校准的多视图设置中使用计算效率高的几何图元来解决这个问题。尽管它们可以产生逼真的头渲染,但它们往往无法代表复杂的运动变化,如嘴部和强烈变化的头部姿势。我们提出了一种新的方法,利用多视图图像中的实时多视角信息来生成高度动态和变形的人类头 Avatar。 我们方法的核心是对头模型的分层表示,允许捕捉面部表情的复杂动态和头部运动。首先,通过从原始输入帧中提取丰富 facial特征,我们学习变形模板网格的粗面部几何形状。然后,在变形表面上初始化3D高斯分布,并在精细级上微调其位置。我们在端到端框架中训练这个粗到细的面部 Avatar 模型,同时将头部姿势作为可学习参数。这使得我们不仅可以通过视频输入实现可控的脸部动画,而且还可以在大运动变化下实现高保真的新视图合成,例如舌头变形和细粒度的牙齿结构。此外,它鼓励学习到的 Avatar 在推理时间点上推广到新的面部表情和头部姿势。 我们在不同数据集上与相关方法进行比较,跨越多个身份展示我们方法的性能。我们还通过展示跨身份面部性能转移应用的潜力来证明我们方法的潜在应用价值。
https://arxiv.org/abs/2409.11951
When your robot grasps an object using dexterous hands or grippers, it should understand the Task-Oriented Affordances of the Object(TOAO), as different tasks often require attention to specific parts of the object. To address this challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented Affordance of Objects, which leverages vision-language models in a zero-shot manner to predict affordance-relevant regions of an object, given a natural language query. Our approach introduces a new paradigm: "static camera, moving object," allowing the robot to better observe and understand the object in hand during manipulation. GauTOAO addresses the limitations of existing methods, which often lack effective spatial grouping, by extracting a comprehensive 3D object mask using DINO features. This mask is then used to conditionally query gaussians, producing a refined semantic distribution over the object for the specified task. This approach results in more accurate TOAO extraction, enhancing the robot's understanding of the object and improving task performance. We validate the effectiveness of GauTOAO through real-world experiments, demonstrating its capability to generalize across various tasks.
当您的机器人使用灵巧的手或夹具抓住物体时,它应该能够理解物体的任务导向势能(TOAO),因为不同的任务通常需要关注物体特定部分的注意力。为解决这个挑战,我们提出了GauTOAO,一个基于高斯的对象任务导向势能框架,它利用了视觉语言模型在零击中方式预测物体上的势能相关区域,给出自然语言查询。我们的方法引入了一个新的范例:“静态相机,移动物体”,使机器人能够在操作过程中更好地观察和理解手中的物体。通过利用DINO特征提取全面的3D物体掩码,GauTOAO解决了现有方法的局限性,即往往缺乏有效的空间分组。这个掩码然后用于有条件地查询高斯分布,在指定的任务上产生物体上的语义分布。这种方法提高了更准确的TOAO提取,提高了机器人对物体的理解,并提高了任务性能。我们通过现实世界的实验验证了GauTOAO的有效性,证明了它在各种任务上的泛化能力。
https://arxiv.org/abs/2409.11941
Tooth arrangement is an essential step in the digital orthodontic planning process. Existing learning-based methods use hidden teeth features to directly regress teeth motions, which couples target pose perception and motion regression. It could lead to poor perceptions of three-dimensional transformation. They also ignore the possible overlaps or gaps between teeth of predicted dentition, which is generally unacceptable. Therefore, we propose DTAN, a differentiable collision-supervised tooth arrangement network, decoupling predicting tasks and feature modeling. DTAN decouples the tooth arrangement task by first predicting the hidden features of the final teeth poses and then using them to assist in regressing the motions between the beginning and target teeth. To learn the hidden features better, DTAN also decouples the teeth-hidden features into geometric and positional features, which are further supervised by feature consistency constraints. Furthermore, we propose a novel differentiable collision loss function for point cloud data to constrain the related gestures between teeth, which can be easily extended to other 3D point cloud tasks. We propose an arch-width guided tooth arrangement network, named C-DTAN, to make the results controllable. We construct three different tooth arrangement datasets and achieve drastically improved performance on accuracy and speed compared with existing methods.
牙齿排列是在数字或牙科规划过程中必不可少的一步。现有的学习方法使用隐藏牙齿特征来直接回归牙齿运动,这导致了目标姿态感知和运动回归的耦合。这可能会导致三维转换效果不佳。它们还忽略了预测牙列中牙齿之间可能存在的重叠或缺口,这在很大程度上是不可接受的。因此,我们提出了DTAN,一种不同的连续轨迹约束的牙齿排列网络,解耦了预测任务和特征建模。通过首先预测最终牙齿姿势的隐藏特征,然后使用它们来辅助预测目标牙齿之间的运动,DTAN解耦了牙齿排列任务。为了更好地学习隐藏特征,DTAN还将牙齿-隐藏特征分解为几何和位置特征,并进一步由特征一致性约束监督。此外,我们还提出了一个新的不同连续轨迹损失函数用于点云数据,约束牙齿间相关的动作,可以轻松地扩展到其他3D点云任务。我们提出了一个带有拱宽引导的牙齿排列网络,名为C-DTAN,以实现结果的可控性。我们构建了三个不同的牙齿排列数据集,并在准确性和速度方面实现了显著的改进,与现有方法相比。
https://arxiv.org/abs/2409.11937
In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
在本文中,我们解决了在训练阶段从未见过的动作类中生成逼真的3D人体运动的问题。我们的方法利用GPT模型的知识,通过分解复杂动作为训练期间观察到的更简单的动作,具体这些动作,然后将这些简单的动作合并成一个真实的动画,利用扩散模型的特性。我们的主张是,这种分解和后续简单动作的重新组合可以合成准确地表示复杂输入动作的动画。这种方法在推理阶段运行,可以与任何预训练的扩散模型集成,从而合成训练数据中没有的动量类别。我们通过将两个基准的人体运动数据集分为基本和复杂动作,然后与最先进的水平进行比较,来评估我们的方法。
https://arxiv.org/abs/2409.11920
Background: Voxel-based analysis (VBA) for population level radiotherapy (RT) outcomes modeling requires topology preserving inter-patient deformable image registration (DIR) that preserves tumors on moving images while avoiding unrealistic deformations due to tumors occurring on fixed images. Purpose: We developed a tumor-aware recurrent registration (TRACER) deep learning (DL) method and evaluated its suitability for VBA. Methods: TRACER consists of encoder layers implemented with stacked 3D convolutional long short term memory network (3D-CLSTM) followed by decoder and spatial transform layers to compute dense deformation vector field (DVF). Multiple CLSTM steps are used to compute a progressive sequence of deformations. Input conditioning was applied by including tumor segmentations with 3D image pairs as input channels. Bidirectional tumor rigidity, image similarity, and deformation smoothness losses were used to optimize the network in an unsupervised manner. TRACER and multiple DL methods were trained with 204 3D CT image pairs from patients with lung cancers (LC) and evaluated using (a) Dataset I (N = 308 pairs) with DL segmented LCs, (b) Dataset II (N = 765 pairs) with manually delineated LCs, and (c) Dataset III with 42 LC patients treated with RT. Results: TRACER accurately aligned normal tissues. It best preserved tumors, blackindicated by the smallest tumor volume difference of 0.24\%, 0.40\%, and 0.13 \% and mean square error in CT intensities of 0.005, 0.005, 0.004, computed between original and resampled moving image tumors, for Datasets I, II, and III, respectively. It resulted in the smallest planned RT tumor dose difference computed between original and resampled moving images of 0.01 Gy and 0.013 Gy when using a female and a male reference.
背景:基于体素(voxel)的分析(VBA)用于人口水平放射治疗(RT)结果建模需要具有拓扑保留的跨患者可变形图像配准(DIR),同时保留在运动图像中的肿瘤并避免由于肿瘤位于固定图像而产生的不现实形变。目的:我们开发了一种肿瘤感知的递归配准(TRACER)深度学习(DL)方法,并评估了其在VBA方面的适用性。方法:TRACER由包含堆叠3D卷积长短期记忆网络(3D-CLSTM)的编码器层组成,然后跟随着解码器和解剖变换层来计算密集变形矢量场(DVF)。使用多个CLSTM层计算变形 progressive sequence。通过在输入中应用肿瘤分割的3D图像对,以实现无监督训练。我们使用来自患有肺癌(LC)的204个3D CT图像对作为数据集I,II,III的训练数据。使用DL分割的LC对数据集I,II,III进行评估。结果:TRACER准确对正常组织进行对齐。在数据集I,II,III上,它最佳地保留了肿瘤,通过最小肿瘤体积差为0.24\%,0.40\%,和0.13 \%以及原初和重新采样运动图像肿瘤的平均方差为0.005,0.005,0.004,计算得到。当使用女性和男性参考时,它导致了原初和重新采样运动图像肿瘤剂量差的最小化,为0.01 Gy和0.013 Gy。
https://arxiv.org/abs/2409.11910
Gait recognition is a rapidly progressing technique for the remote identification of individuals. Prior research predominantly employing 2D sensors to gather gait data has achieved notable advancements; nonetheless, they have unavoidably neglected the influence of 3D dynamic characteristics on recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly captures 3D spatial features but also diminishes the impact of lighting conditions while ensuring privacy protection.The essence of the problem lies in how to effectively extract discriminative 3D dynamic representation from point this http URL this paper, we proposes a method named SpheriGait for extracting and enhancing dynamic features from point clouds for Lidar-based gait recognition. Specifically, it substitutes the conventional point cloud plane projection method with spherical projection to augment the perception of dynamic feature.Additionally, a network block named DAM-L is proposed to extract gait cues from the projected point cloud data. We conducted extensive experiments and the results demonstrated the SpheriGait achieved state-of-the-art performance on the SUSTech1K dataset, and verified that the spherical projection method can serve as a universal data preprocessing technique to enhance the performance of other LiDAR-based gait recognition methods, exhibiting exceptional flexibility and practicality.
步伐识别是一种快速发展的技术,用于远程识别个体。之前的研究主要采用2D传感器收集步伐数据,取得了显著的进步;然而,他们忽略了3D动态特性对识别的影响。利用激光雷达(LiDAR)的3D点云进行步伐识别不仅直接捕捉到3D空间特征,还确保了隐私保护,同时减轻了光照条件的影响。问题在于如何有效地从本文中的这个http URL提取出有区分性的3D动态表示,我们提出了名为SpheriGait的方法,用于从点云中提取和增强动态特征,用于基于激光雷达(Lidar)的步伐识别。具体来说,它用球形投影替代了传统的点云平面投影方法,以增强动态特征的感知。此外,我们还提出了一个名为DAM-L的网络块,用于从投影点云数据中提取步伐线索。我们进行了广泛的实验,结果表明,SpheriGait在SUSTech1K数据集上取得了最先进的性能,并验证了球形投影方法可以成为一种通用的数据预处理技术,以提高其他基于激光雷达(Lidar)的步伐识别方法的性能,具有出色的灵活性和实用性。
https://arxiv.org/abs/2409.11869
Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.
光标组合调整(PBA)法是在假设一个Lambertian世界的条件下广泛用于估计相机姿态和3D几何。然而,由于现实环境中的非扩散反射是常见的,光标一致性假设经常被违反。光标不一致性显著影响了现有PBA方法的可靠性。为了解决这个问题,我们提出了一个基于物理的全新PBA方法。具体来说,我们引入了关于材料、光照和光路的基于物理的权重。这些权重区分了具有不同光标不一致水平的像素对。我们还基于序列图像建立了材料估计的相应模型,基于点云建立了光照估计的相应模型。此外,我们还建立了第一个包含完整光照和材料真实值的非Lambertian场景的SLAM相关数据集。大量的实验证明,我们的PBA方法在准确性上优于现有方法。
https://arxiv.org/abs/2409.11854
In material physics, characterization techniques are foremost crucial for obtaining the materials data regarding the physical properties as well as structural, electronics, magnetic, optic, dielectric, and spectroscopic characteristics. However, for many materials, ensuring availability and safe accessibility is not always easy and fully warranted. Moreover, the use of modeling and simulation techniques need a lot of theoretical knowledge, in addition of being associated to costly computation time and a great complexity deal. Thus, analyzing materials with different techniques for multiple samples simultaneously, still be very challenging for engineers and researchers. It is worth noting that although of being very risky, X-ray diffraction is the well known and widely used characterization technique which gathers data from structural properties of crystalline 1d, 2d or 3d materials. We propose in this paper, a Smart GRU for Gated Recurrent Unit model to forcast structural characteristics or properties of thin films of tin oxide SnO$_2$(110). Indeed, thin films samples are elaborated and managed experimentally and the collected data dictionary is then used to generate an AI -- Artificial Intelligence -- GRU model for the thin films of tin oxide SnO$_2$(110) structural property characterization.
在材料物理中,表征技术对于获取关于材料物理性质以及结构、电子、磁、光学、电学、和光谱特性的数据至关重要。然而,对于许多材料来说,确保可用性和安全访问并不总是容易的,并且完全值得信赖。此外,使用建模和仿真技术需要大量的理论知识,并且还涉及到昂贵的计算时间和复杂性很高的费用。因此,同时分析多个样品的人工工程师和研究人员仍然会面临很大的挑战。值得注意的是,尽管存在很大的风险,X射线衍射是一种已知且广泛使用的表征技术,可以从晶格结构的1d、2d或3d材料的结构性质中收集数据。我们在这篇论文中提出了一种智能GRU模型,用于预测SnO2(110)薄膜的结构特性或性质。事实上,薄膜样品通过实验方法进行详细处理,然后收集到的数据字典被用于生成AI-人工智能GRU模型,用于对SnO2(110)薄膜的结构特性进行表征。
https://arxiv.org/abs/2409.11782
3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.
3D Multi-Object Tracking(MOT)在3D物体检测的快速发展下取得了显著的性能提升,特别是在成本效益的多摄像头设置中。然而,普遍的端到端训练方法导致多摄像头跟踪器产生特定于检测器的模型,限制了它们的多样性。此外,当前的通用跟踪器忽略了多摄像头检测器的独特特点,即运动观察的不确定性和视觉信息的可行性。为了应对这些挑战,我们提出了RockTrack,一种多摄像头检测的3D MOT方法。遵循跟踪ById的方法,RockTrack与各种商用探测器兼容。RockTrack包含一个基于置信度的预处理模块,从单个检测器的不同表示空间中提取可靠的运动和图像观察。这些观察被融合到一个关联模块中,该模块利用几何和外观线索最小化匹配差异。由此产生的匹配通过级联估计过程传播,成为基于噪声建模的基。此外,我们引入了一个新颖的视觉相似度度量指标,明确表示多摄像头设置中物体的亲和性。RockTrack在仅使用视觉信息的nuScenes视觉跟踪领导者名单上实现了最先进的性能,同时具有出色的计算效率。
https://arxiv.org/abs/2409.11749
Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5\% bit rate reductions comparing with the state-of-the-art methods.
光场(LF)图像是一种能够真实地呈现3D场景的点光源分布的4D数据。然而,LF图像的大数据量在实时处理、传输和存储过程中成为最具有挑战性的问题。在本文中,我们提出了一种端到端的光场图像压缩方法:利用分离表示和不对称条带卷积(LFIC-DRASC)以提高压缩效率。首先,我们将LF图像压缩问题形式化为学习一个分离的光场表示网络和一个图像编码-解码网络。其次,我们提出两个新的特征提取器,通过将特征整合到不同维度来利用LF数据的结构先验。同时,通过不对称条带卷积来增强LF特征的分离和去耦合。第三,我们提出了LFIC-DRASC,用于LF图像压缩,其中两个不对称条带卷积(ASC)操作水平和垂直被提出,以捕捉LF特征空间中的长距离相关性。这两个ASC操作可以与平方卷积结合以进一步解耦LF特征,从而增强模型的表示能力。实验结果表明,与最先进的方法相比,所提出的LFIC-DRASC具有平均20.5%的带宽降低。
https://arxiv.org/abs/2409.11711
A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and "organ-background" relative motion.
最小侵入性手术的一个主要局限性是缺乏触觉反馈和透明度,导致难以准确地定位目标器官的内部解剖结构。增强现实(AR)为克服这一挑战提供了一个有前景的解决方案。大量研究表明,将学习和几何方法相结合可以实现精确的术前和术中数据对齐。本文提出了一种实时单目3D跟踪算法,用于后配准任务。采用ORB-SLAM2框架并对其进行修改,以实现基于先验的3D跟踪。原始3D形状用于快速初始化单目SLAM。采用伪分割策略分离目标器官与背景,将3D形状的几何先验作为姿态图中的附加约束。体内和体外实验的结果表明,所提出的3D跟踪系统提供了稳健的3D跟踪,并有效处理了典型挑战,如快速运动、视场外场景、部分可见和“器官背景”相对运动。
https://arxiv.org/abs/2409.11688
In this paper, we propose SRIF, a novel Semantic shape Registration framework based on diffusion-based Image morphing and Flow estimation. More concretely, given a pair of extrinsically aligned shapes, we first render them from multi-views, and then utilize an image interpolation framework based on diffusion models to generate sequences of intermediate images between them. The images are later fed into a dynamic 3D Gaussian splatting framework, with which we reconstruct and post-process for intermediate point clouds respecting the image morphing processing. In the end, tailored for the above, we propose a novel registration module to estimate continuous normalizing flow, which deforms source shape consistently towards the target, with intermediate point clouds as weak guidance. Our key insight is to leverage large vision models (LVMs) to associate shapes and therefore obtain much richer semantic information on the relationship between shapes than the ad-hoc feature extraction and alignment. As a consequence, SRIF achieves high-quality dense correspondences on challenging shape pairs, but also delivers smooth, semantically meaningful interpolation in between. Empirical evidence justifies the effectiveness and superiority of our method as well as specific design choices. The code is released at this https URL.
在本文中,我们提出了SRIF,一种基于扩散图像形态学和流量估计的新颖语义形状注册框架。更具体地说,给定一对外部分割的形状,我们首先从多视角渲染它们,然后利用基于扩散模型的图像平滑框架生成它们之间的中间图像序列。然后将图像输入到动态高斯膨胀框架中,与图像形态学处理中的中间点云一起重构和后处理。最后,针对上述目标,我们提出了一个新颖的注册模块,用于估计连续归一化流,它沿着目标形状恒定扭曲,同时通过中间点云提供弱引导。我们关键的洞察是利用大型视觉模型(LVMs)将形状与形状相关联,从而获得比临时特征提取和对齐更丰富的语义信息。因此,SRIF在具有挑战性的形状对上实现高质量的密集对应关系,同时在之间提供平滑、语义有意义的平滑。实证证据证实了我们的方法以及具体设计选择的有效性和优越性。代码发布在https://此URL上。
https://arxiv.org/abs/2409.11682
3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at this https URL.
3D高斯平铺作为一种强大的3D场景表示技术,高效率地捕捉到细小细节。在本文中,我们介绍了一种新的基于投票的方法,将2D分割模型扩展到3D高斯平铺。我们的方法利用遮罩梯度,其中梯度通过输入2D遮罩进行过滤,然后用于投票以实现精确分割。作为附加品,我们还发现推理时间梯度也可以用于剪枝高斯,从而实现高达21%的压缩。此外,我们探讨了几 shot可扩展性转移,允许2D图像的注释有效地转移到3D高斯平铺。这种方法的基础是简单而 robust 的数学公式,使其成为许多下游应用的高效工具,如增强现实(AR),物体编辑和机器人。项目代码和其他资源可在此处访问:https:// URL。
https://arxiv.org/abs/2409.11681