Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at this https URL.
神经网络辐射场(NeRF)已经彻底改变了基于图像视图合成的领域。然而,NeRF使用直线光线,并无法处理由折射和反射引起的复杂的光路径变化。这导致NeRF无法成功合成透明或闪耀的物体,它们在现实世界机器人和虚拟现实应用中无处不在。在本文中,我们介绍了折射反射域。将物体轮廓作为输入,我们首先使用逐步编码的立方体重构非Lambertian物体的几何形状,然后使用费斯涅尔术语在一个统一框架中模型物体的折射和反射效果。同时,为了高效且有效地减少失真,我们提出了一个虚拟锥超采样技术。我们在不同的形状、背景和费斯涅尔术语的现实世界和合成数据集上对我们的算法进行了基准测试。我们还定性和定量基准了各种编辑应用程序的渲染结果,包括材料编辑、物体替换/插入和环境照明估计。代码和数据在这个httpsURL上公开可用。
https://arxiv.org/abs/2309.13039
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
人工制作的图像质量指标,例如PSNR和SSIM,在重建攻击下通常用于评估模型隐私风险。在这些指标下,确定的重构图像通常表示更多的隐私泄露。另一方面,确定的整然差异图像则表示更强的抵御攻击能力。然而,没有保证这些指标很好地反映了人类的意见,作为模型隐私泄露的判断,它们更加可靠。在本文中,我们全面研究了这些人工制作的指标对人类对重构图像的隐私信息感知的准确性的符合性。在5个数据集,包括自然图像、人脸和精细类别,我们使用4个现有的攻击方法从多个分类模型中重构图像,并为每个重构图像询问多个人类标注者是否可识别。我们的研究表明,人工制作的指标仅与人类评估隐私泄露的微弱相关,甚至这些指标本身也常常互相矛盾。这些观察暗示了社区当前指标的风险。为了应对这些潜在风险,我们提出了一种基于学习的指标,称为SemSim,以评估原始和重构图像语义相似性。SemSim使用标准三因素损失进行训练,使用原始图像作为参考,其中一个可识别的重构图像作为正样本,一个不可识别的重构图像作为负样本。通过训练人类标注,SemSim表现出在语义层面上更多的隐私泄露反映。我们表明,SemSim与人类判断的相关性比现有的指标高得多。此外,这种强相关性可以扩展到未观测的数据集、模型和攻击方法。
https://arxiv.org/abs/2309.13038
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
在计算机断层扫描(CT)生成中,重建内核一致性至关重要,因为 underlying CT texture 在 quantitative image analysis 中可能会影响测量结果。一致性(即内核转换)最小化由于不一致的重建内核引起的测量差异。现有方法研究在一家或多家制造商中一致性 CT 扫描。但是,这些方法需要具有空间和行为上的匹配的硬和软的重建内核的配对扫描。此外,需要在制造商内部不同内核配对之间训练大量模型。在本研究中,我们采用一个无配对的图像转换方法,以研究来自不同制造商的重建内核之间的一致性,并通过构建多路径循环生成对抗网络(GAN)来构建路径循环生成器。我们使用来自国家肺筛检试验数据集的西门子和GE的硬和软的重建内核。我们使用每个重建内核的 50 次扫描训练路径循环生成器。为了评估一致性对重建内核的影响,我们每个从西门子硬内核、GE软内核和GE硬内核中将 50 次扫描 harmonize 到西门子软内核(B30f)上并评估微血管计数。我们考虑年龄、吸烟状况、性别和供应商等因素,并使用线性模型进行方差分析,以评估微血管计数结果的精度。我们的方法最小化了微血管测量的差异,并强调年龄、性别、吸烟状况和供应商对微血管计数量化的影响。
https://arxiv.org/abs/2309.12953
With the development of the neural field, reconstructing the 3D model of a target object from multi-view inputs has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a certain object indicated by users on-the-fly. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose Neural Object Cloning (NOC), a novel high-quality 3D object reconstruction method, which leverages the benefits of both neural field and SAM from two aspects. Firstly, to separate the target object from the scene, we propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D variation field. The 3D variation field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. Then, apart from 2D masks, we further lift the 2D features of the SAM encoder into a 3D SAM field in order to improve the reconstruction quality of the target object. NOC lifts the 2D masks and features of SAM into the 3D neural field for high-quality target object reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be released.
随着神经网络的发展,从多视角输入中重建目标对象的3D模型已越来越引起社区的关注。现有的方法通常需要对整个场景学习一个神经网络场,而如何通过用户的实时指示来重建特定的对象仍未被深入研究。考虑到Segment Anything Model(SAM)在分割任何2D图像方面的有效性,本文提出了一种名为 Neural Object cloning(NOC)的新高质量的3D对象重建方法,该方法利用神经网络场和SAM的两个方面的优势。首先,为了从目标对象与场景分离,我们提出了一种新策略,将SAM的多视角2D分割掩膜转换为一个统一的3D变化场。变化场随后被投影到2D空间,并生成SAM的新提示。这个过程迭代直到收敛,以分离目标对象与场景。除了2D掩膜外,我们还进一步将SAM编码器的2D特征提取到3DSAM场中,以提高目标对象的重建质量。NOC将SAM的2D掩膜和特征提取到3D神经网络场中,以进行高质量的目标对象重建。我们针对多个基准数据集进行了详细的实验,以证明我们方法的优势。代码将发布。
https://arxiv.org/abs/2309.12790
Eyebrows play a critical role in facial expression and appearance. Although the 3D digitization of faces is well explored, less attention has been drawn to 3D eyebrow modeling. In this work, we propose EMS, the first learning-based framework for single-view 3D eyebrow reconstruction. Following the methods of scalp hair reconstruction, we also represent the eyebrow as a set of fiber curves and convert the reconstruction to fibers growing problem. Three modules are then carefully designed: RootFinder firstly localizes the fiber root positions which indicates where to grow; OriPredictor predicts an orientation field in the 3D space to guide the growing of fibers; FiberEnder is designed to determine when to stop the growth of each fiber. Our OriPredictor is directly borrowing the method used in hair reconstruction. Considering the differences between hair and eyebrows, both RootFinder and FiberEnder are newly proposed. Specifically, to cope with the challenge that the root location is severely occluded, we formulate root localization as a density map estimation task. Given the predicted density map, a density-based clustering method is further used for finding the roots. For each fiber, the growth starts from the root point and moves step by step until the ending, where each step is defined as an oriented line with a constant length according to the predicted orientation field. To determine when to end, a pixel-aligned RNN architecture is designed to form a binary classifier, which outputs stop or not for each growing step. To support the training of all proposed networks, we build the first 3D synthetic eyebrow dataset that contains 400 high-quality eyebrow models manually created by artists. Extensive experiments have demonstrated the effectiveness of the proposed EMS pipeline on a variety of different eyebrow styles and lengths, ranging from short and sparse to long bushy eyebrows.
眉毛在面部表情和外貌中发挥着关键作用。尽管对人脸3D数字化的研究已经充分展开,但人们对3D眉毛建模的关注程度相对较低。在本研究中,我们提出了EMS,是第一个基于学习的框架,用于单视角3D眉毛重建。遵循眉毛种植的方法,我们也将眉毛表示为一组纤维曲线,并将重建转换为纤维生长问题。三个模块因此被精心设计:首先,RootFinder localizing the fiber root positions which indicates where to grow;其次,OriPredictor预测3D空间中的向量场,以指导纤维的生长;最后,FiberEnder设计来确定何时停止每个纤维的生长。我们的OriPredictor直接借用了眉毛种植中使用的方法。考虑到眉毛和头发的差异,同时新提出了RootFinder和FiberEnder。具体来说,为了应对困难,root位置的严重遮挡,我们将其定义为密度映射估计任务。给定预测的密度映射,基于密度的聚类方法被进一步使用以找到root。对于每个纤维,从root点开始,逐步增长,直到结束,每个步骤根据预测的向量场定义为一条定向线,具有恒定长度。为了确定何时结束,设计了一个像素对齐的RNN架构,形成二进制分类器,以输出每个生长步骤是否停止。为了支持所有 proposed 网络的训练,我们建立了第一个3D合成眉毛数据集,其中包含由艺术家手动创建的400个高质量的眉毛模型。广泛的实验已经证明了所提出的EMS管道对于各种不同眉毛样式和长度的有效性,包括短而稀疏到长而浓密的眉毛。
https://arxiv.org/abs/2309.12787
While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.
尽管图像数据开始享受基于遮蔽和自重构目标的简单但有效的自我监督学习方案,由于引入了 tokenization 过程和视觉Transformer骨架,卷积神经网络也成为了另一种重要且广泛应用的图像数据架构。尽管卷积神经网络有对比学习技术来驱动自我监督学习,但它们仍然面临着利用这种简单而普遍的遮蔽操作来显著改善其学习过程的困难。在这项工作中,我们旨在减轻将遮蔽操作纳入对比学习框架,作为增加的增广方法,对卷积神经网络作为对比学习框架额外的增广方法的负担。除了无害的边缘(在遮蔽和未被遮蔽区域之间)以及由卷积神经网络的遮蔽操作引起的其他不利效应,我们特别发现了一种潜在问题,即在一个对比样本对中,随机选择的遮蔽区域可能过于集中在重要或显著的对象上,从而导致对另一个视图的学习对比度产生误导。为此,我们建议 explicitly 考虑可见性约束,其中遮蔽区域在 foreground 和 background 之间更均匀地分布以实现基于遮蔽的增广。此外,我们通过在输入图像中遮蔽较大的显著斑点来引入硬负样本。我们对多种数据集、对比学习和后续任务进行了广泛的实验,并成功地证明了我们提出的方法和几个前沿基准之间的差距。
https://arxiv.org/abs/2309.12757
Accurate calibration is crucial for using multiple cameras to triangulate the position of objects precisely. However, it is also a time-consuming process that needs to be repeated for every displacement of the cameras. The standard approach is to use a printed pattern with known geometry to estimate the intrinsic and extrinsic parameters of the cameras. The same idea can be applied to event-based cameras, though it requires extra work. By using frame reconstruction from events, a printed pattern can be detected. A blinking pattern can also be displayed on a screen. Then, the pattern can be directly detected from the events. Such calibration methods can provide accurate intrinsic calibration for both frame- and event-based cameras. However, using 2D patterns has several limitations for multi-camera extrinsic calibration, with cameras possessing highly different points of view and a wide baseline. The 2D pattern can only be detected from one direction and needs to be of significant size to compensate for its distance to the camera. This makes the extrinsic calibration time-consuming and cumbersome. To overcome these limitations, we propose eWand, a new method that uses blinking LEDs inside opaque spheres instead of a printed or displayed pattern. Our method provides a faster, easier-to-use extrinsic calibration approach that maintains high accuracy for both event- and frame-based cameras.
准确的校准对于使用多个相机精确地确定物体的位置至关重要。然而,它也是每次相机位移都需要重复的过程。标准方法是使用已知几何结构的印刷图案来估计相机的内积和外积参数。同样的想法也可以应用于基于事件的相机,虽然这需要额外的工作。通过使用事件帧的重构,可以检测到印刷图案。也可以在同一屏幕上显示闪烁的图案。然后,可以从事件中直接检测到图案。这些校准方法可以为帧和基于事件的相机提供准确的内积校准。然而,使用2D图案对于多相机外积校准有几个限制,因为相机具有高度不同的视角和广泛的基线。2D图案只能从一个方向检测到,并且需要相当大的大小来补偿与相机的距离。这使得外积校准耗时且繁琐。为了克服这些限制,我们提出了eWand,一种新方法,它使用闪烁LED在透明球体内部而不是印刷或显示的图案。我们的方法提供了更快的、更容易使用的外积校准方法,对于基于事件和帧的相机保持高准确性。
https://arxiv.org/abs/2309.12685
Accurate 3D shape abstraction from a single 2D image is a long-standing problem in computer vision and graphics. By leveraging a set of primitives to represent the target shape, recent methods have achieved promising results. However, these methods either use a relatively large number of primitives or lack geometric flexibility due to the limited expressibility of the primitives. In this paper, we propose a novel bi-channel Transformer architecture, integrated with parameterized deformable models, termed DeFormer, to simultaneously estimate the global and local deformations of primitives. In this way, DeFormer can abstract complex object shapes while using a small number of primitives which offer a broader geometry coverage and finer details. Then, we introduce a force-driven dynamic fitting and a cycle-consistent re-projection loss to optimize the primitive parameters. Extensive experiments on ShapeNet across various settings show that DeFormer achieves better reconstruction accuracy over the state-of-the-art, and visualizes with consistent semantic correspondences for improved interpretability.
精确从一张2D图像中提取3D形状是一个长期存在的问题,利用一组基本点来代表目标形状,最近的方法取得了良好的结果。然而,这些方法要么使用相对较多的基本点,要么由于基本点的表达能力有限而缺乏几何灵活性。在本文中,我们提出了一种新颖的双通道Transformer架构,并集成了参数可调变形模型,称为Deformer,以同时估计基本点的全局和局部变形。通过这种方式,Deformer可以在使用少量基本点的同时抽象复杂的物体形状,并提供更好的几何覆盖率和更精细的细节。然后,我们引入了力驱动的动态适应和循环一致性投影损失,以优化基本点参数。在多种设置下的ShapeNet实验表明,Deformer在技术水平之上实现更好的重构精度,并使用一致的语义对应来改善解释性。
https://arxiv.org/abs/2309.12594
Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes. We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.
现有的时间分辨率非可见光(NLOS)成像方法通过反转间接照明在可见Relay表面上测量的光学路径来重构隐藏场景。这些方法由于反转混淆和捕获噪声而容易出现重建错误,通常可以通过手动选择滤波函数和参数来减轻。我们引入了一个全互变的夜晚 NLOS 反渲染管道,在重构隐藏场景时自我校准成像参数,仅使用测量的照明作为输入,同时工作在时间和频率 domains。该管道从 NLOS 体积强度中提取隐藏场景的几何表示,使用不同的临时渲染方法估计由这种几何信息产生的在Relay壁表面的时间分辨率照明。然后我们使用梯度下降算法优化成像参数,最小化我们的模拟时间分辨率照明与测量照明之间的误差。我们的全互变管道将基于散射的体积 NLOS 重建与路径空间光传输结合,使用简单的ray marching技术提取隐藏场景表面的详细、密集一组表面点和法向量。我们证明了我们的方法能够 consistently 重构几何和反射率,即使在非常高的噪声水平下也是如此。
https://arxiv.org/abs/2309.12047
Reconstructing a surface from a point cloud is an underdetermined problem. We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior. Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.
从点云中重构表面是一个 underdetermined 问题。我们使用神经网络研究并量化在Poisson平滑前重构的不确定性。我们的算法解决了现有工作的主要限制,可以 fully integrated into 3D 扫描管道,从获取初始重构到决定下一个最佳传感器位置,并在捕获更多数据时更新重构。
https://arxiv.org/abs/2309.11993
We present NeuralLabeling, a labeling approach and toolset for annotating a scene using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach.
我们提出了神经网络标注(NeuralLabeling),一种使用 bounding boxes 或Mesh 进行标注的方法和工具集,以生成分割掩码、赋予度图、2D bounding boxes、3D bounding boxes、6DOF 物体姿态、深度图和物体Mesh。神经网络标注使用神经网络辐射场(NeRF)作为渲染器,允许使用3D空间工具进行标注,同时包括几何线索如遮挡,仅依靠从多个视角收集的图像作为输入。为了证明神经网络标注对机器人实际问题的应用性,我们添加了透明物体RGB帧的 ground truth 深度图和使用RGBD传感器收集的放在洗碗机中眼镜的噪声深度图,生成了洗碗机30k数据集。我们表明,使用注释的深度图进行简单的深度学习网络的训练比使用以前应用的弱监督方法训练得到的更高的重构性能。
https://arxiv.org/abs/2309.11966
We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.
我们提出了 Ego3DPose,一个高度准确的双视角自我中心3D姿态重建系统。自我中心双视角 setup 在多种应用中具有实际和有用的实用性,然而,它仍然 largely 未被 explore。由于观看失真、严重 self-occlusion 和二维图像中 joint 的视角范围有限,姿态估计精度一直较低。在这里,我们注意到,包含自我中心双视角输入的两个重要3D cues,即立体对应率和视角,被忽视了。当前方法 heavily 依赖2D图像特征,隐含地学习3D信息,这导致总体精度较低。我们观察到,它们不仅 在挑战性的 self-occlusion 情况下失败,而且在估计可见 joint 位置时也失败。为了解决这些问题,我们提出了两个新的方法。首先,我们设计了一个两路径网络架构,其中一条路径以每个肢体的独立视角估计姿态。在没有全身信息的情况下,可以减轻倾向于训练全身分布的偏见。其次,我们利用身体肢体的自我中心视图,它表现出强烈的视角变异(例如,当靠近相机时,变得非常大的手)。我们提出了一种利用三角学的新视角意识表示方法,使网络可以估计肢体的3D方向。最后,我们开发了一个端到端的姿态重建网络,协同这两种方法。我们的全面评估表明,Ego3DPose 在 UnrealEgo 数据集上比当前最先进的模型高出23.1%的姿态估计误差,我们的定性结果突出了我们方法在所有场景和挑战下的优势。
https://arxiv.org/abs/2309.11962
Recently, linear computed tomography (LCT) systems have actively attracted attention. To weaken projection truncation and image the region of interest (ROI) for LCT, the backprojection filtration (BPF) algorithm is an effective solution. However, in BPF for LCT, it is difficult to achieve stable interior reconstruction, and for differentiated backprojection (DBP) images of LCT, multiple rotation-finite inversion of Hilbert transform (Hilbert filtering)-inverse rotation operations will blur the image. To satisfy multiple reconstruction scenarios for LCT, including interior ROI, complete object, and exterior region beyond field-of-view (FOV), and avoid the rotation operations of Hilbert filtering, we propose two types of reconstruction architectures. The first overlays multiple DBP images to obtain a complete DBP image, then uses a network to learn the overlying Hilbert filtering function, referred to as the Overlay-Single Network (OSNet). The second uses multiple networks to train different directional Hilbert filtering models for DBP images of multiple linear scannings, respectively, and then overlays the reconstructed results, i.e., Multiple Networks Overlaying (MNetO). In two architectures, we introduce a Swin Transformer (ST) block to the generator of pix2pixGAN to extract both local and global features from DBP images at the same time. We investigate two architectures from different networks, FOV sizes, pixel sizes, number of projections, geometric magnification, and processing time. Experimental results show that two architectures can both recover images. OSNet outperforms BPF in various scenarios. For the different networks, ST-pix2pixGAN is superior to pix2pixGAN and CycleGAN. MNetO exhibits a few artifacts due to the differences among the multiple models, but any one of its models is suitable for imaging the exterior edge in a certain direction.
近年来,线性计算机断层扫描系统(LCT)正在得到广泛关注。为了削弱投影截断和提高LCT中感兴趣区域(ROI)的图像质量,backprojection filtration(BPF)算法是一种有效的解决方案。然而,在BPF算法中,实现稳定的内部重构非常困难,而对于LCT中区分的backprojection(DBP)图像,多个Hilbert变换的有限逆旋转操作将模糊图像。为了满足多个LCT重建场景,包括内部ROI、完整物体和视野外区域,以及避免Hilbert滤波的逆旋转操作,我们提出了两种重建架构。第一种方法是将多个DBP图像重叠在一起,以获得完整的DBP图像,然后使用网络学习覆盖Hilbert滤波函数,被称为“Overlay-Single Network(OSNet)”。第二种方法是使用多个网络分别训练多个线性扫描的DBP图像的不同方向Hilbert滤波模型,然后将它们重叠在一起,生成重建结果,即“Multiple Networks Overlaying(MNetO)”。在两种架构中,我们引入了 Swin Transformer(ST)块到pix2pixGAN生成器的输入块中,以提取同时从DBP图像中提取的局部和全局特征。我们研究了两个不同的网络架构,包括视野内区域的大小、像素大小、投影数量、几何放大和处理时间。实验结果显示,两个架构都可以恢复图像。OSNet在各种场景下都优于BPF。对于不同的网络,ST-pix2pixGAN比pix2pixGAN和循环GAN更好。MNetO表现出一些干扰,由于多个模型之间的差异,但它的任何模型都适用于从某个方向拍摄外部边缘的图像。
https://arxiv.org/abs/2309.11858
We study neural image compression based on the Sparse Visual Representation (SVR), where images are embedded into a discrete latent space spanned by learned visual codebooks. By sharing codebooks with the decoder, the encoder transfers integer codeword indices that are efficient and cross-platform robust, and the decoder retrieves the embedded latent feature using the indices for reconstruction. Previous SVR-based compression lacks effective mechanism for rate-distortion tradeoffs, where one can only pursue either high reconstruction quality or low transmission bitrate. We propose a Masked Adaptive Codebook learning (M-AdaCode) method that applies masks to the latent feature subspace to balance bitrate and reconstruction quality. A set of semantic-class-dependent basis codebooks are learned, which are weighted combined to generate a rich latent feature for high-quality reconstruction. The combining weights are adaptively derived from each input image, providing fidelity information with additional transmission costs. By masking out unimportant weights in the encoder and recovering them in the decoder, we can trade off reconstruction quality for transmission bits, and the masking rate controls the balance between bitrate and distortion. Experiments over the standard JPEG-AI dataset demonstrate the effectiveness of our M-AdaCode approach.
我们研究基于稀疏视觉表示(SVR)的神经网络图像压缩,其中图像被嵌入到一个离散的潜在空间中,由学习的视觉代码库组成。通过与解码器分享代码库,编码器可以将高效、跨平台稳定的整数代码word索引传输,而解码器使用索引来重构嵌入的潜在特征。之前的SVR-based压缩缺乏速率和失真权衡的有效机制,只能追求高重构质量和低传输比特率。我们提出了一种带掩码的自适应代码库学习方法(M-AdaCode),将掩码应用于潜在特征子空间,以平衡比特率和重构质量。我们学习了一组语义类别相关的基代码库,这些基代码库被加权组合,以生成高质量的重构丰富的潜在特征。组合权重自适应地从每个输入图像中提取,提供附加传输成本的逼真信息。通过在编码器中掩盖不重要的权重,并在解码器中恢复它们,我们可以以传输比特率为重构质量 trade-off 交换质量 for 传输比特,并控制掩码率平衡比特率和失真。在标准JPEG-AI数据集上进行的实验表明,我们的M-AdaCode方法具有有效性。
https://arxiv.org/abs/2309.11661
Creating accurate 3D models of tree topology is an important task for tree pruning. The 3D model is used to decide which branches to prune and then to execute the pruning cuts. Previous methods for creating 3D tree models have typically relied on point clouds, which are often computationally expensive to process and can suffer from data defects, especially with thin branches. In this paper, we propose a method for actively scanning along a primary tree branch, detecting secondary branches to be pruned, and reconstructing their 3D geometry using just an RGB camera mounted on a robot arm. We experimentally validate that our setup is able to produce primary branch models with 4-5 mm accuracy and secondary branch models with 15 degrees orientation accuracy with respect to the ground truth model. Our framework is real-time and can run up to 10 cm/s with no loss in model accuracy or ability to detect secondary branches.
创建准确的树拓扑3D模型是修剪树的重要任务。3D模型被用来决定哪些分支需要修剪,然后执行修剪操作。以前的创建3D树模型的方法通常依赖于点云,通常处理点云很费计算资源,并且可能受到数据缺陷的影响,特别是薄的分支。在本文中,我们提出了一种方法,用于 actively扫描一条主要树分支,检测需要修剪的 secondary branches,并使用仅安装在机器人臂上的RGB摄像头重建它们的3D几何。我们进行了实验验证,我们的装置能够产生与基准模型精度高达4-5毫米的主要树模型和 secondary branch模型,与基准模型的15度方向精度相当。我们的框架是实时的,可以运行高达10厘米/秒,而模型精度或检测 secondary branches的能力没有损失。
https://arxiv.org/abs/2309.11580
3D face reconstruction algorithms from images and videos are applied to many fields, from plastic surgery to the entertainment sector, thanks to their advantageous features. However, when looking at forensic applications, 3D face reconstruction must observe strict requirements that still make its possible role in bringing evidence to a lawsuit unclear. An extensive investigation of the constraints, potential, and limits of its application in forensics is still missing. Shedding some light on this matter is the goal of the present survey, which starts by clarifying the relation between forensic applications and biometrics, with a focus on face recognition. Therefore, it provides an analysis of the achievements of 3D face reconstruction algorithms from surveillance videos and mugshot images and discusses the current obstacles that separate 3D face reconstruction from an active role in forensic applications. Finally, it examines the underlying data sets, with their advantages and limitations, while proposing alternatives that could substitute or complement them.
3D面部重建算法从图像和视频中应用于许多领域,从Plastic surgery到娱乐行业,由于其有利的特点。然而,在考虑法医应用时,3D面部重建必须遵守严格的要求,仍使其在将证据带到诉讼中的可能性作用不明确。广泛的调查仍缺失对其在法医应用中的限制、潜力和极限的深入研究。当前的研究目标是本文的目标,它开始澄清法医应用和生物识别之间的关系,并以面部识别为焦点。因此,它提供了从监控视频和囚犯照片中提取的3D面部重建算法的成就分析,并讨论了将3D面部重建从法医应用中积极参与所需的当前障碍。最后,它研究了其基础数据集的优势和限制,并提出了可以替代或补充它们的其他选择。
https://arxiv.org/abs/2309.11357
Neural radiance field is an emerging rendering method that generates high-quality multi-view consistent images from a neural scene representation and volume rendering. Although neural radiance field-based techniques are robust for scene reconstruction, their ability to add or remove objects remains limited. This paper proposes a new language-driven approach for object manipulation with neural radiance fields through dataset updates. Specifically, to insert a new foreground object represented by a set of multi-view images into a background radiance field, we use a text-to-image diffusion model to learn and generate combined images that fuse the object of interest into the given background across views. These combined images are then used for refining the background radiance field so that we can render view-consistent images containing both the object and the background. To ensure view consistency, we propose a dataset updates strategy that prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. We show that under the same dataset updates strategy, we can easily adapt our method for object insertion using data from text-to-3D models as well as object removal. Experimental results show that our method generates photorealistic images of the edited scenes, and outperforms state-of-the-art methods in 3D reconstruction and neural radiance field blending.
神经网络光场是一种新兴渲染方法,可以从神经网络场景表示和体积渲染中生成高质量的多视角一致性图像。虽然基于神经网络光场的 techniques 在场景重建方面非常稳健,但添加或删除物体的能力仍然受到限制。本文提出了一种通过数据集更新进行语言驱动的对象操纵方法。具体而言,通过将多个视角图像所代表的新前景色物体插入背景光场中,我们使用文本到图像扩散模型学习并生成 combined 图像,将感兴趣的物体在各个视角下融合到给定的背景中。这些 combined 图像后来被用于优化背景光场,以生成包含物体和背景的一致性图像。为了保证视角一致性,我们提出了一种数据集更新策略,该策略将光场训练放在与已经训练过的视图接近的相机视角之前,并优先考虑光场训练与已训练视图之间的相似性。我们证明,在相同的数据集更新策略下,可以使用文本到 3D 模型和物体删除的数据来轻松地适应我们的方法和用于物体插入和删除。实验结果显示,我们的方法生成编辑场景的逼真图像,并在 3D 重建和神经网络光场融合中优于最先进的方法。
https://arxiv.org/abs/2309.11281
Coarse architectural models are often generated at scales ranging from individual buildings to scenes for downstream applications such as Digital Twin City, Metaverse, LODs, etc. Such piece-wise planar models can be abstracted as twins from 3D dense reconstructions. However, these models typically lack realistic texture relative to the real building or scene, making them unsuitable for vivid display or direct reference. In this paper, we present TwinTex, the first automatic texture mapping framework to generate a photo-realistic texture for a piece-wise planar proxy. Our method addresses most challenges occurring in such twin texture generation. Specifically, for each primitive plane, we first select a small set of photos with greedy heuristics considering photometric quality, perspective quality and facade texture completeness. Then, different levels of line features (LoLs) are extracted from the set of selected photos to generate guidance for later steps. With LoLs, we employ optimization algorithms to align texture with geometry from local to global. Finally, we fine-tune a diffusion model with a multi-mask initialization component and a new dataset to inpaint the missing region. Experimental results on many buildings, indoor scenes and man-made objects of varying complexity demonstrate the generalization ability of our algorithm. Our approach surpasses state-of-the-art texture mapping methods in terms of high-fidelity quality and reaches a human-expert production level with much less effort. Project page: https://vcc.tech/research/2023/TwinTex.
粗粒度的建筑模型通常生成从单个建筑到场景的尺度,为下游应用如数字孪生城市、虚拟现实、LODs等生成间接可视化数据。这些分块平面模型可以从3D密度重构中抽象成双胞胎。然而,与真实的建筑或场景相比,这些模型通常缺乏真实的纹理,因此不适合生动地展示或直接引用。在本文中,我们介绍了 twinTex,是第一个自动纹理映射框架,用于生成分块平面 proxy 的逼真纹理。我们的方法解决了这种双胞胎纹理生成面临的大部分挑战。具体来说,对于每个基本平面,我们首先选择考虑纹理质量、视角质量和立面纹理完整的小批量照片。然后,从所选照片的不同级别中获取线特征(LoLs),以生成后续步骤的指导。通过 LoLs,我们使用优化算法将纹理与几何从局部到全局对齐。最后,我们优化了一个多Mask初始化组件和新数据集的扩散模型,以填充缺失区域。对许多建筑、室内场景和复杂的人工物体的不同复杂性的实验结果证明了我们的算法的泛化能力。我们的方法在高保真度质量方面超越了最先进的纹理映射方法,并以较少的努力达到人类专家生产水平。项目页面:https://vcc.tech/research/2023/TwinTex.
https://arxiv.org/abs/2309.11258
This paper proposes a shape anchor guided learning strategy (AncLearn) for robust holistic indoor scene understanding. We observe that the search space constructed by current methods for proposal feature grouping and instance point sampling often introduces massive noise to instance detection and mesh reconstruction. Accordingly, we develop AncLearn to generate anchors that dynamically fit instance surfaces to (i) unmix noise and target-related features for offering reliable proposals at the detection stage, and (ii) reduce outliers in object point sampling for directly providing well-structured geometry priors without segmentation during reconstruction. We embed AncLearn into a reconstruction-from-detection learning system (AncRec) to generate high-quality semantic scene models in a purely instance-oriented manner. Experiments conducted on the challenging ScanNetv2 dataset demonstrate that our shape anchor-based method consistently achieves state-of-the-art performance in terms of 3D object detection, layout estimation, and shape reconstruction. The code will be available at this https URL.
本论文提出了一种基于形状 anchor 的引导学习策略(AncLearn),以实现鲁棒的室内外场景理解。我们观察到,当前方法用于提议特征分组和实例点采样通常会引入大量噪声,对实例检测和网格重建造成干扰。因此,我们开发了 AncLearn,用于生成动态适应实例表面的 anchor,以(i) 在检测阶段混合噪声和目标相关特征,提供可靠的提议,并(ii) 在重建过程中减少对象点采样中的异常值,直接提供良好的几何前置条件,而不需要分割。我们将 AncLearn 嵌入到从检测学习系统的重建学习系统中(AncRec),以生成纯实例导向的质量语义场景模型。在挑战性的扫描Netv2数据集上进行的实验表明,我们的基于形状 anchor 的方法在 3D 物体检测、布局估计和形状重建方面始终实现最先进的性能。代码将在本 https URL 上提供。
https://arxiv.org/abs/2309.11133
Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.
声音可以在我们的生活中传递重要的空间推理信息。为了赋予深度学习这种能力,我们利用跨modal知识蒸馏技术解决了在2D和3D中利用声音进行密集室内预测的挑战。在这项工作中,我们提出了一种通过匹配(SAM)蒸馏框架来实现空间对齐的方法,该框架能够在视觉到音频知识转移中识别两个modal之间的局部对应关系。SAM将音频特征与视觉一致性可学习的空间嵌入相结合,以解决学生模型多层中的不一致性问题。我们的方法不依赖于特定的输入表示,可以在输入形状或维度上具有灵活性,而不会影响性能。凭借新 curated 的基准名为周围Dense Auditory Prediction(DAPS),我们成为第一个在2D和3D中利用音频观察解决广泛directional周围室内预测问题的人。具体而言,对于基于音频的深度估计、语义分割和具有挑战性的3D场景重建,我们提出的蒸馏框架 consistently 实现了各种指标和基本架构的前沿性能。
https://arxiv.org/abs/2309.11081