Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy.
家庭、工厂和零售店的库存监测依赖于在物体交换、添加、移除或移动时维护数据。我们介绍了一种名为Lifelong LERF的方法,允许具有最小计算的移动机器人共同优化其周围密集语言和几何表示。Lifelong LERF通过检测语义变化并选择性地更新环境中的这些区域来维持此表示。人类用户可以通过提供自然语言查询并获得潜在物体位置的3D热图来查询库存。为了管理计算负载,我们使用Fog-ROS2,一个云计算机器人平台,将资源密集的任务卸载到云端。Lifelong LERF从单目RGBD SLAM后端获得姿态,并使用这些姿态逐步优化语义监测的LERF。在摆放3-5个物体在桌面上的实验和配备Turtlebot和RealSense摄像头的Turtlebot的实验表明,Lifelong LERF可以以高达91%的准确度持续适应物体变化。
https://arxiv.org/abs/2403.10494
Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
人类通过简单参数方程来感知和构建世界。特别是,我们可以通过使用体积原型如立方体或圆柱体来描述人造环境。推断这些原型对于实现高级抽象场景描述非常重要。之前基于原型的抽象方法直接估计形状参数,并且只能复制简单的物体。相比之下,我们提出了一种鲁棒的原始建模器,它可以有意义地通过立方体抽象复杂现实世界。一个基于神经网络的RANSAC估计器将这些原型适配到深度图。我们通过将网络的条件限制在前置检测到的场景部分上,逐个解析它。为了从单色人眼图像中获得立方体,我们还在端到端深度估计CNN上进行优化。简单地最小化点对原点的距离会导致大或伪劣的立方体遮挡场景的部分。因此,我们提出了一个更好的遮挡感知距离度量来处理不透明的场景。此外,我们还提出了一个基于立方体的求解器,它在减少推理时间的同时提供更加简洁的场景抽象。所提出的算法不需要劳动密集型的标签,例如立方体注释,来进行训练。在NYU Depth v2数据集上的结果表明,所提出的算法成功地抽象了杂乱的现实生活中3D场景布局。
https://arxiv.org/abs/2403.10452
Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.
隐式神经表示方法在从无结构的自然照片集中学习3D场景方面取得了令人印象深刻的进展,但仍然受到体积渲染的大计算成本的限制。更最近,3D高斯平铺作为一种更快的替代方法出现了,具有卓越的渲染质量和训练效率,特别是对于小规模和以物体为中心的场景。然而,这种技术在处理无结构自然数据集时表现不佳。为了解决这个问题,我们将3D高斯平铺扩展到处理无结构图像集合。我们通过建模来捕捉渲染图像中的光度变化来实现这一目标。此外,我们引入了一种新的机制来以无需监督的方式训练暂时的Gauss分布,以处理场景遮挡的存在。在多样照片集场景和多程获取户外标志的实验中,我们的方法比先前的作品在提高效率的同时实现了最先进的渲染结果。
https://arxiv.org/abs/2403.10427
Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: this https URL
操作变形对象在机器人领域仍然具有挑战性,由于状态估计、长距离规划以及预测物体在交互过程中的变形困难。这些挑战在3D变形对象上尤为突出。我们提出了SculptDiff,一种基于目标条件扩散的模仿学习框架,可直接从点云状态观测中学习各种目标形状的捏塑策略。据我们所知,这是第一个在现实生活中成功学习3D变形对象操作策略的方法。如果您想观看雕塑视频,访问我们的数据集和硬件CAD模型,请查看项目网站:https:// this https URL。
https://arxiv.org/abs/2403.10401
Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at this https URL. The code and models are available at this https URL.
随着预训练的2D扩散模型可用性的增加,利用Score Distillation Sampling(SDS)进行图像到3D生成的方法取得了显著的进展。大多数现有方法结合了来自2D扩散模型的新视图提升,通常在参考图像应用硬L2图像监督。然而,过于依赖图像可能导致2D扩散模型的归纳知识受到污染,从而导致平面或畸变的3D生成频繁出现。在本文中,我们重新审视了图像到3D,并提出了Isotropic3D,一种仅接受图像CLIP嵌入的图像到3D生成管道。Isotropic3D允许优化在仅基于SDS损失的情况下实现对称。我们的框架的核心是基于扩散模型的两级微调。首先,我们通过用图像编码器替换文本编码器来微调文本到3D扩散模型,从而初步获得图像到图像的功能。其次,我们使用我们的显式多视图注意(EMA)进行微调,将噪音多视角图像与无噪音的参考图像作为显式条件。在整个过程中,CLIP嵌入被发送到扩散模型中,而参考图像在微调后就被丢弃了。因此,使用单个图像CLIP嵌入,Isotropic3D能够生成多视角相互一致的图像,以及具有更对称和平衡的内容、均匀的几何形状、丰富的色彩纹理和更少的畸变的3D模型,与现有的图像到3D方法相比,仍然保留了与参考图像的相似性。项目页面可以在该https URL上找到。代码和模型可以在这个https URL上找到。
https://arxiv.org/abs/2403.10395
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
近年来在人体形状学习方面的进步表明,神经隐含模型在从有限视角下生成三维人体表面以及甚至单个RGB图像时非常有效。然而,现有的单目方法仍然很难从仅有的几个视角下恢复细粒度几何细节,如面部、手部或布料皱纹。它们还容易产生深度模糊,导致沿着相机光学轴的扭曲几何。在本文中,我们探讨了在重建过程中将深度观察纳入其中的好处,通过引入ANIM,一种在单目RGB-D图像上重构任意3D人体形状的新方法,前所未有的准确度。我们的模型从多分辨率像素对齐和体素对齐的特征中学习几何细节,以利用深度信息并实现空间关系,减轻深度模糊。我们进一步通过引入深度监督策略来提高重构形状的质量,提高距离场估计点在重构表面上的准确性。实验证明,ANIM在用作输入的RGB、表面法线、点云或RGB-D数据上优于最先进的论文。此外,我们还引入了ANIM-Real,一个新的多模态数据集,包括高质量扫描和消费级RGB-D相机,以及调整ANIM的协议,使其从现实世界的人类捕捉中实现高质量重构。
https://arxiv.org/abs/2403.10357
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: this https URL.
自动驾驶领域吸引了相当大的关注,尤其是在从多个相机直接推断鸟视图(BEV)中的3D对象的方法。一些尝试还探索了利用单个图像中的2D检测器来提高3D检测的性能。然而,这些方法依赖于两个阶段的处理过程,其中仅在关键词选择或查询初始化时利用2D检测结果。在本文中,我们提出了一个名为SimPB的单一模型,该模型同时从多个相机检测2D物体和3D物体。为实现这一目标,我们引入了一个由多个鸟视图2D检测层和几个3D检测层组成的混合编码器。为了不断更新和优化2D和3D结果之间的交互,我们提出了动态查询分配模块和自适应查询聚合模块。此外,我们还使用了查询组注意来加强每个相机组内2D查询之间的互动。在实验中,我们在 nuScenes 数据集上评估了我们的方法,并展示了对于2D和3D检测任务的积极结果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.10353
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV unwrapping on unstructured 3D point clouds. Technically, we propose ParaPoint, an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization by building point-wise mappings between given 3D points and 2D UV coordinates with adaptively deformed boundaries. We ingeniously construct several geometrically meaningful sub-networks with specific functionalities, and assemble them into a bi-directional cycle mapping framework. We also design effective loss functions and auxiliary differential geometric constraints for the optimization of the neural mapping process. To the best of our knowledge, this work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries. Experiments demonstrate the effectiveness and inspiring potential of our proposed learning paradigm. The code will be publicly available.
表面参数化是一个基本的几何处理问题,具有丰富的下游应用。传统的解决方案旨在操作在良好行为网格模型上的高质量三角形,这些三角形是由专门的3D建模软件生成的,因此无法满足当前普通3D数据处理需求的爆炸性增长。在本文中,我们试图对无结构3D点云进行UV解包。从技术上讲,我们提出了ParaPoint,一种通过自适应变形来构建点对2D UV坐标的无监督神经网络,以实现全局自由边界表面参数化。我们巧妙地构建了几个具有特定功能的几何有意义子网络,并将它们组装成一个双向循环映射框架。我们还为神经映射过程的优化设计了有效的损失函数和辅助差分几何约束。据我们所知,这篇论文是首次研究追求全局映射和自由边界的神经点云参数化。实验证明了我们提出的学习范式的有效性和鼓舞人心的潜力。代码将公开可用。
https://arxiv.org/abs/2403.10349
Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct urban outdoor scenes due to their large, unbounded, and highly detailed nature. Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required. To tackle such issues, we present SCILLA, a new hybrid implicit surface learning method to reconstruct large driving scenes from 2D images. SCILLA's hybrid architecture models two separate implicit fields: one for the volumetric density and another for the signed distance to the surface. To accurately represent urban outdoor scenarios, we introduce a novel volume-rendering strategy that relies on self-supervised probabilistic density estimation to sample points near the surface and transition progressively from volumetric to surface representation. Our solution permits a proper and fast initialization of the signed distance field without relying on any geometric prior on the scene, compared to concurrent methods. By conducting extensive experiments on four outdoor driving datasets, we show that SCILLA can learn an accurate and detailed 3D surface scene representation in various urban scenarios while being two times faster to train compared to previous state-of-the-art solutions.
近年来,神经隐式表面表示方法在3D建模方面取得了令人印象深刻的成果。然而,由于其大、无限制和高度详细的特点,现有的解决方案在重建城市户外场景方面存在较大困难。因此,为了获得准确的重建,需要添加一些监督数据,如激光雷达(LiDAR)、强大的几何先验和长的训练时间。为解决这些问题,我们提出了SCILLA,一种新的混合隐式表面学习方法,用于从2D图像重建大型驾驶场景。SCILLA的混合架构建模了两个隐式场:一个是体积密度,另一个是表面签名距离。为了准确地表示城市户外场景,我们引入了一种新的体积渲染策略,它依赖于自监督概率密度估计来采样表面附近的关键点,并从体积到表面表示逐渐过渡。与同时期方法相比,我们的解决方案不需要依赖于场景的任何几何先验,从而实现了一个合适且快速的初始化签名距离场。通过在四个户外驾驶数据集上进行广泛的实验,我们发现,SCILLA可以在各种城市场景中学习到准确的、细致的3D表面场景表示,同时其训练速度是前 state-of-the-art 解决方案的两倍。
https://arxiv.org/abs/2403.10344
In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in robotic applications plagued by poor lighting or visual obstructions. This limitation overlooks the capabilities of infrared (IR) cameras, which excel in low-light detection and present a robust alternative under such adverse scenarios. To tackle these issues, we introduce Thermal-NeRF, the first method that estimates a volumetric scene representation in the form of a NeRF solely from IR imaging. By leveraging a thermal mapping and structural thermal constraint derived from the thermal characteristics of IR imaging, our method showcasing unparalleled proficiency in recovering NeRFs in visually degraded scenes where RGB-based methods fall short. We conduct extensive experiments to demonstrate that Thermal-NeRF can achieve superior quality compared to existing methods. Furthermore, we contribute a dataset for IR-based NeRF applications, paving the way for future research in IR NeRF reconstruction.
近年来,神经辐射场(NeRFs)已经在编码高度详尽的三维几何和环境外观方面展现出显著潜力,被视为传统显式表示法的有希望的替代方法,用于三维场景重构。然而,主要依赖于红外的成像假设了理想的光线条件,而在机器人应用中常常无法实现的光线条件。这个限制忽略了红外(IR)摄像机在低光检测方面的能力,后者在恶劣场景中表现出卓越的稳健性。为了应对这些问题,我们引入了 Thermal-NeRF,是第一种仅从红外成像估算体积场景表示的 NeRF 的方法。通过利用从红外图像的热特性获得的结构和热约束,我们的方法在视觉衰减的场景中表现出无与伦比的恢复 NeRFs 的能力。我们进行了广泛的实验来证明 Thermal-NeRF 可以在现有方法中实现卓越的质量和效果。此外,我们还为 IR-based NeRF 应用贡献了一个数据集,为未来在 IR NeRF 重建方面的研究铺平了道路。
https://arxiv.org/abs/2403.10340
Human avatar has become a novel type of 3D asset with various applications. Ideally, a human avatar should be fully customizable to accommodate different settings and environments. In this work, we introduce NECA, an approach capable of learning versatile human representation from monocular or sparse-view videos, enabling granular customization across aspects such as pose, shadow, shape, lighting and texture. The core of our approach is to represent humans in complementary dual spaces and predict disentangled neural fields of geometry, albedo, shadow, as well as an external lighting, from which we are able to derive realistic rendering with high-frequency details via volumetric rendering. Extensive experiments demonstrate the advantage of our method over the state-of-the-art methods in photorealistic rendering, as well as various editing tasks such as novel pose synthesis and relighting. The code is available at this https URL.
人类形象已经成为一种新型的3D资产,具有各种应用。理想情况下,一个人类形象应该完全可定制,以适应不同的设置和环境。在这项工作中,我们引入了NECA,一种可以从单目或稀疏视视频中学到多面体人类表示的方法,实现了在各个方面(如姿势、阴影、形状、光照和纹理)的细粒度定制。我们方法的核心是代表人类在互补的双空间中,预测几何、反射率、阴影以及外部光照,从而能够通过体积渲染实现高度逼真的渲染。大量实验证明,我们的方法在光照实时渲染方法和各种编辑任务(例如新姿势合成和重新着色)方面优于最先进的方法。代码可在此处访问:https://www.example.com/
https://arxiv.org/abs/2403.10335
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50\% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: this https URL.
基于经典结构视觉定位方法具有高精度,但存储、速度和隐私方面存在牺牲。一种最近的创新,关键点场景坐标回归(KSCR)命名为D2S,通过利用图注意力网络增强关键点关系并使用简单的多层感知器(MLP)预测其3D坐标,从而解决这些问题。然后通过PnP+RANSAC确定相机姿态,利用已建立的2D-3D对应关系。虽然KSCR在多个基准测试中实现了竞争力的结果,与像HLoc这样的最先进图像检索方法相媲美,但当数据样本有限时,由于深度学习模型对广泛数据的高度依赖,其性能受到限制。为了解决这个问题,本文提出了一种通过使用神经辐射场(NeRF)进行关键点描述符的流程来解决这个挑战。通过生成新的姿态并将其输入已训练的NeRF模型以生成新的视图,我们的方法在数据稀疏环境中提高了KSCR的泛化能力。所提出的系统可以通过提高50\%的定位精度,仅用很少的时间来合成数据,显著提高局部定位精度。此外,它的模块化设计允许将多个NeRF集成到一个 versatile和高效的视觉定位解决方案中。实现可通过以下链接公开获得:https://this URL。
https://arxiv.org/abs/2403.10297
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at this https URL.
深度伪造对社会的威胁和对信息网络安全的影响引起了公众的广泛关注,推动了在深度伪造视频检测领域加大投入。目前,基于3D CNN的方法在计算需求方面较高,尽管在性能方面已经取得了一定的成果。本文介绍了一种优雅而简单 yet effective 的策略,名为缩略图布局(TALL),该策略将视频剪辑转换为预定义的布局,以实现保留空间和时间依赖关系的目的。这个转换过程涉及在每帧相同的位置逐序遮罩帧。这些帧 then 被缩放成子帧并重新排列成预定义的布局,形成缩略图。TALL对模型具有模型无关性,具有显著的简单性,只需要很少的代码修改。此外,我们还引入了图推理单元(GRB)和语义一致性(SC)损失,以增强TALL,最终实现TALL++。GRB增强了不同语义区域之间的交互,以捕捉语义级别的不一致性线索。语义一致性损失对语义特征施加一致性约束,以提高模型的泛化能力。在内部数据集、跨数据集、扩散生成的图像检测和深度伪造生成方法识别等大量实验中,TALL++实现了与最先进方法相当或超越最先进方法的结果,证明了我们的方法在各种深度伪造检测问题上的有效性。代码可在此处下载:https://www.xxxxxx.com。
https://arxiv.org/abs/2403.10261
Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website this https URL.
从单视图图像中重建详细的三维物体仍然是一个具有挑战性的任务,因为可用信息有限。在本文中,我们引入了FDGaussian,一种新的两阶段框架,用于从单视图图像中重建详细的三维物体。最近的方法通常利用预训练的2D扩散模型生成输入图像的合理新视图,但它们要么遇到多视角不一致问题,要么缺乏几何一致性。为了克服这些挑战,我们提出了一个正交平面分解机制,从2D输入中提取3D几何特征,从而生成一致的多视角图像。此外,我们通过引入 epipolar 注意机制进一步加速最先进的高斯膨胀,将来自不同视角的图像融合在一起。我们证明了FDGaussian在不同的视图中具有高度的一致性,并成功重建了高质量的三维物体,无论是定性的还是定量的。更多例子可以在我们的网站上找到,网址是https://this URL。
https://arxiv.org/abs/2403.10242
With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.
随着神经元辐射场和生成模型的不断发展,已经提出了许多从2D图像中学习3D人类生成的方法。这些方法允许控制生成3D人类的角度,并能够从不同的角度进行渲染。然而,这些方法都没有探索人图像合成中的语义解离,即它们无法分离生成不同语义部分,如身体、顶部 和底部。此外,由于神经元辐射场的高计算成本,现有方法仅能在$512^2$的分辨率上合成图像。为了克服这些限制,我们引入了 SemanticHuman-HD,这是第一个实现语义解离的人图像合成方法。值得注意的是,SemanticHuman-HD 也是第一个在$1024^2$的分辨率上实现3D意识图像生成的方法,得益于我们提出的3D意识超分辨率模块。通过利用深度图和语义掩码作为3D意识超分辨率的有指导,我们在体积渲染过程中显著减少了抽样点数,从而降低了计算成本。我们的比较实验证实了我们的方法具有优越性。每个所提出的组件的有效性也通过消融实验得到了验证。此外,我们的方法为各种应用开辟了令人兴奋的领域,包括3D衣物的生成、语义意识图像生成、可控制图像生成和跨域图像生成。
https://arxiv.org/abs/2403.10166
This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences.
本文提出了一种名为GGRt的新颖的泛化性新视图合成方法,该方法减轻了对于真实相机姿态、处理高分辨率图像的复杂性和长时间优化过程的需求,从而在现实场景中更有效地促进了3D高斯拓展(3D-GS)的应用。具体来说,我们设计了一个由迭代姿态优化网络(IPO-Net)和一个泛化的3D-Gaussians(G-3DG)模型组成的全新联合学习框架。通过联合学习机制,拟议框架可以从图像观测中固有地估计鲁棒相对姿态信息,从而主要减轻对于真实相机姿态的需求。此外,我们还实现了一个延迟反向传播机制,使得高分辨率训练和推理得以实现,克服了以前方法的分辨率限制。为了提高速度和效率,我们进一步引入了一个渐进式高斯缓存模块,在训练和推理过程中动态调整。作为第一个无姿态免费的泛化3D-GS框架,GGRt在推理过程中达到$\ge$5 FPS,实时渲染过程中达到$\ge$100 FPS。通过广泛的实验,我们证明了我们的方法在推理速度和效果方面优于现有的基于NeRF的无姿态免费技术。同时,我们的贡献也为将计算机视觉和计算机图形学应用于实际应用提供了显著的跃升,在LLFF、KITTI和Waymo Open数据集上取得了最先进的性能,并使沉浸式体验的实时渲染成为可能。
https://arxiv.org/abs/2403.10147
Tactile sensing represents a crucial technique that can enhance the performance of robotic manipulators in various tasks. This work presents a novel bioinspired neuromorphic vision-based tactile sensor that uses an event-based camera to quickly capture and convey information about the interactions between robotic manipulators and their environment. The camera in the sensor observes the deformation of a flexible skin manufactured from a cheap and accessible 3D printed material, whereas a 3D printed rigid casing houses the components of the sensor together. The sensor is tested in a grasping stage classification task involving several objects using a data-driven learning-based approach. The results show that the proposed approach enables the sensor to detect pressing and slip incidents within a speed of 2 ms. The fast tactile perception properties of the proposed sensor makes it an ideal candidate for safe grasping of different objects in industries that involve high-speed pick-and-place operations.
触觉感知是一种关键的技术,可以提高机器人操作器在各种任务中的性能。这项工作介绍了一种新颖的生物启发的基于事件事件的触觉传感器,该传感器使用基于事件的相机快速捕捉并传递机器人操作器与其环境之间的互动信息。相机在传感器中观察到由廉价的、可访问的3D打印材料制成的柔性皮肤的变形,而一个3D打印的刚性外壳则将传感器的组件集成在一起。传感器在一个涉及多个物体的握持阶段分类任务中进行了测试,利用数据驱动的学习方法。结果表明,与所提出的方案相比,该传感器能够以2ms的速度检测到按下和滑动事件。所提出的传感器快速触觉感知特性使其成为安全握持不同物体的理想选择,这些物体在涉及高速捡选和放置的行业中使用。
https://arxiv.org/abs/2403.10120
We propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image, whereas, the previous method that incorporates the RS into NeRF requires strict sequential data input, limiting its widespread applicability. In constant, our method recovers the physical formation of RS images by estimating camera poses and velocities, thereby removing the input constraints on sequential data. Moreover, we adopt a coarse-to-fine training strategy, in which the RS epipolar constraints of the pairwise frames in the scene graph are used to detect the camera poses that fall into local minima. The poses detected as outliers are corrected by the interpolation method with neighboring poses. The experimental results validate the effectiveness of our method over state-of-the-art works and demonstrate that the reconstruction of 3D representations is not constrained by the requirement of video sequence input.
我们提出了一个新颖的滚动 shutter(RS)卷积神经网络辐射场(NeRF)调整方法,该方法利用无序滚动 shutter(RS)图像来获得隐式的三维表示。现有的 NeRF 方法由于图片中的 RS 效应导致低质量图像和错误的初始相机姿态,而先前的将 RS 融入 NeRF 的方法需要严格的顺序数据输入,限制了其普遍应用。在不变的情况下,我们的方法通过估计相机姿态和速度,恢复场景图中外部帧的 RS 极化约束,从而消除了对序列数据的输入约束。此外,我们采用了一种粗-到-精细的训练策略,其中使用场景图中外部帧的 RS 极化约束来检测陷入局部最小值的相机姿态。检测到的异常姿态通过相邻姿态的插值方法进行纠正。实验结果证实了我们的方法在现有工作之上具有有效性,并表明三维表示的重建并不受视频序列输入要求的限制。
https://arxiv.org/abs/2403.10119
Visual-language models (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is lacking. We investigate two critical properties of map quality: queryability and consistency. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate two aspects of consistency: intra-map consistency and inter-map consistency. Intra-map consistency captures the ability of the embeddings to represent abstract semantic classes, and inter-map consistency captures the generalization properties of the representation. In this paper, we propose a way to analyze the quality of maps created using VLMs, which forms an open-source benchmark to be used when proposing new open-vocabulary map representations. We demonstrate the benchmark by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. We find that OpenScene outperforms VLMaps with both encoders, and LSeg outperforms OpenSeg with both methods.
近年来,通过使用视觉语言模型(VLMs)的潜在表示(即嵌入)来表示自然语言语义在地图上的情况,在机器人导航中首次引入了视觉语言模型(VLMs)。主要的优点是超越了人类创建的小型标签集,走向了开放词汇场景理解。尽管有一些案例表明,这种地图支持后续任务,如导航,使用这些嵌入进行地图质量的严谨分析是缺乏的。我们对地图质量的两种关键性质进行了研究:查询性和一致性。查询性解决了从嵌入中检索信息的能力。我们研究了两种一致性:内在地图一致性和跨地图一致性。内在地图一致性捕捉了嵌入能够表示抽象语义类的能力,跨地图一致性捕捉了表示的泛化特性。在本文中,我们提出了一种分析使用VLMs创建的地图质量的方法,该方法可以作为提出新的开放词汇地图表示的开放式源代码基准。我们通过使用两个编码器LSeg和OpenSeg对两个最先进的VLM方法创建的地图进行评估。我们发现,OpenScene在两个编码器上都优于VLMaps,而LSeg在两个方法上都优于OpenSeg。
https://arxiv.org/abs/2403.10117
In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans, we propose to leverage category-consistent sparse keypoints to naturally handle both full and partial object scans. Specifically, we first employ a lightweight retrieval module to establish a keypoint-based embedding space, measuring the similarity among objects by dynamically aggregating deformation-aware local-global features around extracted keypoints. Objects that are close in the embedding space are considered similar in geometry. Then we introduce the neural cage-based deformation module that estimates the influence vector of each keypoint upon cage vertices inside its local support region to control the deformation of the retrieved shape. Extensive experiments on the synthetic dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED surpasses existing state-of-the-art approaches by a large margin. Codes and trained models will be released in this https URL.
在本文中,我们提出了KP-RED,一个统一的关键点驱动的检索和变形框架,其接收对象扫描作为输入,并共同检索和变形预处理数据库中最具几何相似性的CAD模型,以紧贴目标。与现有的基于密度的匹配方法不同,我们提出了一种类一致稀疏关键点,以自然地处理对象的全部分离扫描。具体来说,我们首先采用轻量级检索模块建立了一个基于关键点的嵌入空间,通过动态聚合变形感知局部全局特征来测量对象之间的相似度。在嵌入空间中距离较近的对象被认为在几何上相似。然后我们引入了基于神经网络的变形模块,估计每个关键点在其局部支持区域内的笼子顶点上的影响向量,以控制检索到的形状的变形。对人造数据集PartNet和现实数据集Scan2CAD的实验表明,KP-RED在现有技术的领先地位上有了很大的提升。代码和训练好的模型将在此处的https:// URL上发布。
https://arxiv.org/abs/2403.10099