Large Reconstruction Models have made significant strides in the realm of automated 3D content generation from single or multiple input images. Despite their success, these models often produce 3D meshes with geometric inaccuracies, stemming from the inherent challenges of deducing 3D shapes solely from image data. In this work, we introduce a novel framework, the Large Image and Point Cloud Alignment Model (LAM3D), which utilizes 3D point cloud data to enhance the fidelity of generated 3D meshes. Our methodology begins with the development of a point-cloud-based network that effectively generates precise and meaningful latent tri-planes, laying the groundwork for accurate 3D mesh reconstruction. Building upon this, our Image-Point-Cloud Feature Alignment technique processes a single input image, aligning to the latent tri-planes to imbue image features with robust 3D information. This process not only enriches the image features but also facilitates the production of high-fidelity 3D meshes without the need for multi-view input, significantly reducing geometric distortions. Our approach achieves state-of-the-art high-fidelity 3D mesh reconstruction from a single image in just 6 seconds, and experiments on various datasets demonstrate its effectiveness.
大规模重建模型在从单个或多个输入图像生成自动3D内容方面取得了显著的进步。尽管它们的成功,但这些模型通常会产生具有几何不准确性的3D网格,这是由于从图像数据中推断3D形状固有的挑战。在这项工作中,我们引入了一个新的框架, Large Image and Point Cloud Alignment Model(LAM3D),它利用3D点云数据来提高生成3D网格的保真度。我们的方法从基于点云的网络开始,该网络有效地生成精确且有意义的三维平面,为准确3D网格重建奠定了基础。在此基础上,我们的 Image-Point-Cloud Feature Alignment 技术处理单个输入图像,将点云与潜在的三维平面对齐,为图像特征赋予 robust 3D信息。这一过程不仅丰富了图像特征,而且还有助于无需多视角输入的情况下生成高保真度的3D网格,显著减少了几何失真。我们的方法在只需6秒钟内实现最先进的 high-fidelity 3D 网格重建,各种数据集上的实验都证明了其有效性。
https://arxiv.org/abs/2405.15622
Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.
单目相机的校准是许多三维视觉应用的关键前提条件。尽管取得了显著的进步,但现有的方法通常依赖于特定的假设,并且在不同的现实场景中往往难以推广,其性能受到不足的训练数据的限制。最近,基于大型数据集训练的扩散模型已被证实具有生成多样化、高质量图像的能力。这一成功表明,这些模型具有有效理解各种视觉信息的能力。在这项工作中,我们利用预训练扩散模型中蕴含的全面视觉知识,使得单目相机的固有参数估计更加稳健和准确。具体来说,我们将估计相机固有参数的四个自由度(4-DoF)的问题重新表述为扩散模型的密射图生成任务。图详细描述了每个像素在RGB图像中的入射角度,其格式与扩散模型的范式相吻合。在推理过程中,可以通过简单的非学习RANSAC算法从入射图估计相机固有参数。此外,为了进一步提高性能,我们使用联合估计深度图来提供额外的几何信息,以提高入射图估计的准确性。在多个测试数据集上的大量实验证明,我们的模型在实现最优性能的同时,预测误差降低了40%。此外,实验还表明,通过我们的方法估计的准确相机固有和深度图可以为诸如从单个野外图像进行3D重建等实际应用带来极大的益处。
https://arxiv.org/abs/2405.15619
Machine unlearning is a promising paradigm for removing unwanted data samples from a trained model, towards ensuring compliance with privacy regulations and limiting harmful biases. Although unlearning has been shown in, e.g., classification and recommendation systems, its potential in medical image-to-image translation, specifically in image recon-struction, has not been thoroughly investigated. This paper shows that machine unlearning is possible in MRI tasks and has the potential to benefit for bias removal. We set up a protocol to study how much shared knowledge exists between datasets of different organs, allowing us to effectively quantify the effect of unlearning. Our study reveals that combining training data can lead to hallucinations and reduced image quality in the reconstructed data. We use unlearning to remove hallucinations as a proxy exemplar of undesired data removal. Indeed, we show that machine unlearning is possible without full retraining. Furthermore, our observations indicate that maintaining high performance is feasible even when using only a subset of retain data. We have made our code publicly accessible.
机器学习消除是一种有益的范式,旨在从训练好的模型中移除不需要的数据样本,以确保遵守隐私法规并限制有害偏见。尽管在分类和推荐系统中已经证明了消除 unwanted data samples 的潜力,但其在医学图像到图像转译中的潜在作用,特别是在图像重建,还没有被充分调查。本文表明,机器消除学习在 MRI 任务中是可能的,并有可能有益于偏见消除。我们设置了一个协议,研究了不同器官数据集之间共享知识的多少,使我们能够有效地量化消除的影响。我们的研究揭示,结合训练数据可能导致伪影和降低图像质量。我们使用消除伪影作为 unwanted data removal 的示例。事实上,我们发现机器消除学习可以在不需要完全重训练的情况下实现。此外,我们的观察结果表明,即使仅使用保留的数据,保持高性能也是可能的。我们已经将我们的代码公开发布。
https://arxiv.org/abs/2405.15517
We propose a volumetric representation based on primitives to model scattering and emissive media. Accurate scene representations enabling efficient rendering are essential for many computer graphics applications. General and unified representations that can handle surface and volume-based representations simultaneously, allowing for physically accurate modeling, remain a research challenge. Inspired by recent methods for scene reconstruction that leverage mixtures of 3D Gaussians to model radiance fields, we formalize and generalize the modeling of scattering and emissive media using mixtures of simple kernel-based volumetric primitives. We introduce closed-form solutions for transmittance and free-flight distance sampling for 3D Gaussian kernels, and propose several optimizations to use our method efficiently within any off-the-shelf volumetric path tracer by leveraging ray tracing for efficiently querying the medium. We demonstrate our method as an alternative to other forms of volume modeling (e.g. voxel grid-based representations) for forward and inverse rendering of scattering media. Furthermore, we adapt our method to the problem of radiance field optimization and rendering, and demonstrate comparable performance to the state of the art, while providing additional flexibility in terms of performance and usability.
我们提出了基于原语的体积表示来建模散射和辐射媒体。准确的场景表示对于许多计算机图形应用来说至关重要。通用和统一的表示,可以同时处理表面和体积表示,允许进行物理准确的建模,仍然是一个研究挑战。受到最近基于3D高斯混合的方法进行场景复原的启发,我们用基于简单核的体积原语的混合来建模散射和辐射媒体。我们引入了3D高斯核的传输特性和自由飞行距离抽样形式的闭式公式,并提出了几种优化方法,以便在任意开箱即用的体积路径追踪器上有效地使用我们的方法。我们将我们的方法作为其他形式体积建模(如体素网格表示)前向和反向渲染散射媒体的一种替代方法。此外,我们将我们的方法适应于辐射场优化和渲染问题,并在与现有技术相当的效果下提供额外的灵活性,同时注重性能和易用性。
https://arxiv.org/abs/2405.15425
Continuum robots have emerged as a promising technology in the medical field due to their potential of accessing deep sited locations of the human body with low surgical trauma. When deriving physics-based models for these robots, evaluating the models poses a significant challenge due to the difficulty in accurately measuring their intricate shapes. In this work, we present an optimization based 3D shape registration algorithm for estimation of the backbone shape of slender continuum robots as part of a pho togrammetric measurement. Our approach to estimating the backbones optimally matches a parametric three-dimensional curve to images of the robot. Since we incorporate an iterative closest point algorithm into our method, we do not need prior knowledge of the robots position within the respective images. In our experiments with artificial and real images of a concentric tube continuum robot, we found an average maximum deviation of the reconstruction from simulation data of 0.665 mm and 0.939 mm from manual measurements. These results show that our algorithm is well capable of producing high accuracy positional data from images of continuum robots.
由于 Continuum 机器人在医学领域具有通过低手术创伤访问人体深度定位部位的潜力,因此 Continuum 机器人已成为一个有前景的技术。在为这些机器人 derived 的物理学模型中,评估模型会面临一个显著的挑战,即准确测量其复杂形状的难度。在这项工作中,我们提出了一个用于估计细长 Continuum 机器人 backbone 形状的优化 3D 形状匹配算法,作为光束图测量的一部分。我们采用的方法是基于参数的 3D 曲线图像,使得我们不需要在各自图像中机器人位置的先前知识。在我们的实验中,与同心管 Continuum 机器人的虚拟和真实图像相比,我们估计重构的位移平均值为 0.665 mm 和 0.939 mm。这些结果表明,我们的算法可以从图像中产生高精度的定位数据。
https://arxiv.org/abs/2405.15336
This research introduces a Positive Reconstruction Framework based on positive psychology theory. Overcoming negative thoughts can be challenging, our objective is to address and reframe them through a positive reinterpretation. To tackle this challenge, a two-fold approach is necessary: identifying cognitive distortions and suggesting a positively reframed alternative while preserving the original thought's meaning. Recent studies have investigated the application of Natural Language Processing (NLP) models in English for each stage of this process. In this study, we emphasize the theoretical foundation for the Positive Reconstruction Framework, grounded in broaden-and-build theory. We provide a shared corpus containing 4001 instances for detecting cognitive distortions and 1900 instances for positive reconstruction in Mandarin. Leveraging recent NLP techniques, including transfer learning, fine-tuning pretrained networks, and prompt engineering, we demonstrate the effectiveness of automated tools for both tasks. In summary, our study contributes to multilingual positive reconstruction, highlighting the effectiveness of NLP in cognitive distortion detection and positive reconstruction.
这项研究基于积极心理学理论引入了积极重建框架。克服消极思维可能具有挑战性,我们的目标是通过积极重新解释来应对和重塑它们。为解决这一挑战,需要采取双倍方法:识别认知偏差并提出一个积极重新解释的替代方案,同时保留原始思想的内涵。最近的研究已经探讨了自然语言处理(NLP)模型在英语各阶段这个过程中的应用。在这项研究中,我们强调了积极重建框架的理论基础,基于拓展与构建理论。我们提供了用于检测认知偏差和进行积极重建的共4001个实例的共享语料库,以及1900个用于中文的实例。利用最近的自然语言处理技术,包括迁移学习、预训练网络的微调以及提示工程,我们证明了自动工具在两个任务上都具有有效性。总之,我们的研究为多语言积极重建做出了贡献,突出了自然语言处理在认知偏差检测和积极重建中的有效性。
https://arxiv.org/abs/2405.15334
Solving 3D medical inverse problems such as image restoration and reconstruction is crucial in modern medical field. However, the curse of dimensionality in 3D medical data leads mainstream volume-wise methods to suffer from high resource consumption and challenges models to successfully capture the natural distribution, resulting in inevitable volume inconsistency and artifacts. Some recent works attempt to simplify generation in the latent space but lack the capability to efficiently model intricate image details. To address these limitations, we present Blaze3DM, a novel approach that enables fast and high-fidelity generation by integrating compact triplane neural field and powerful diffusion model. In technique, Blaze3DM begins by optimizing data-dependent triplane embeddings and a shared decoder simultaneously, reconstructing each triplane back to the corresponding 3D volume. To further enhance 3D consistency, we introduce a lightweight 3D aware module to model the correlation of three vertical planes. Then, diffusion model is trained on latent triplane embeddings and achieves both unconditional and conditional triplane generation, which is finally decoded to arbitrary size volume. Extensive experiments on zero-shot 3D medical inverse problem solving, including sparse-view CT, limited-angle CT, compressed-sensing MRI, and MRI isotropic super-resolution, demonstrate that Blaze3DM not only achieves state-of-the-art performance but also markedly improves computational efficiency over existing methods (22~40x faster than previous work).
解决现代医学领域中3D医疗反问题(如图像恢复和重建)至关重要。然而,3D医疗数据的维度诅咒使得主流体积方法在资源消耗和模型捕捉自然分布方面陷入困境,导致必然的体积不一致和伪影。一些最近的工作尝试简化在潜在空间中的生成,但缺乏有效地建模复杂图像细节的能力。为了克服这些限制,我们提出了Blaze3DM,一种通过将紧凑的三平面神经场和强大的扩散模型集成起来,实现快速且高保真的生成的新颖方法。在技术上,Blaze3DM首先通过同时优化数据相关的三平面嵌入和共享解码器来优化数据,将每个三平面重构为相应的3D体积。为了进一步提高3D一致性,我们引入了一个轻量级的三维感知模块来建模三个垂直平面的相关性。然后,在扩散模型的训练下,实现了条件三平面生成和非条件三平面生成,最终编码为任意大小的体积。在零散射击3D医疗反问题解决实验中,包括稀疏视野CT、有限角CT、压缩感知MRI和MRI自适应超分辨率,Blaze3DM不仅实现了最先进的性能,而且显著提高了现有方法(比以前的工作快22~40倍)。
https://arxiv.org/abs/2405.15241
Understanding the hidden mechanisms behind human's visual perception is a fundamental quest in neuroscience, underpins a wide variety of critical applications, e.g. clinical diagnosis. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challenging, costly, daunting, and demanding for professional training. Despite remarkable progress in artificial intelligence (AI) based fMRI analysis, existing solutions are limited and far away from being clinically meaningful. In this context, we leap forward to demonstrate how AI can go beyond the current state of the art by decoding fMRI into visually plausible 3D visuals, enabling automatic clinical analysis of fMRI data, even without healthcare professionals. Innovationally, we reformulate the task of analyzing fMRI data as a conditional 3D scene reconstruction problem. We design a novel cross-modal 3D scene representation learning method, Brain3D, that takes as input the fMRI data of a subject who was presented with a 2D object image, and yields as output the corresponding 3D object visuals. Importantly, we show that in simulated scenarios our AI agent captures the distinct functionalities of each region of human vision system as well as their intricate interplay relationships, aligning remarkably with the established discoveries of neuroscience. Non-expert diagnosis indicate that Brain3D can successfully identify the disordered brain regions, such as V1, V2, V3, V4, and the medial temporal lobe (MTL) within the human visual system. We also present results in cross-modal 3D visual construction setting, showcasing the perception quality of our 3D scene generation.
理解人类视觉 perception 背后的潜在机制是神经科学的一个基本目标,这为许多关键应用奠定了基础,如临床诊断。因此,研究人类思维活动的神经响应,如功能磁共振成像(fMRI),一直是神经科学研究的重要手段。然而,分析fMRI信号具有挑战性、代价昂贵、令人沮丧和需要专业培训的特点。尽管基于人工智能(AI)的fMRI分析在人工智能领域取得了显著进展,但现有的解决方案仍然有限,离临床意义还有很长的路要走。 在这种背景下,我们跃进一步,展示AI如何超越目前的技术水平,将fMRI编码为视觉上逼真的3D视图,实现自动临床分析fMRI数据,甚至在没有医疗专业人员的情况下。创新地,我们将分析fMRI数据的任务重新建模为条件3D场景重构问题。我们设计了一种新颖的跨模态3D场景表示学习方法,Brain3D,它接收受试者呈现的2D物体图像作为输入,并产生相应的3D物体视图作为输出。重要的是,我们证明了在模拟场景中,我们的AI代理能够捕捉到人类视觉系统中每个区域独特的功能性和错综复杂相互作用关系,与神经科学所得到的发现相吻合。非专家诊断表明,Brain3D可以成功地识别出人类视觉系统中的异常脑区,如V1、V2、V3、V4和颞叶(MTL)。我们还展示了在跨模态3D视觉构建设置中的结果,展示了我们3D场景生成的感知质量。
https://arxiv.org/abs/2405.15239
This work introduces Neural Elevations Models (NEMos), which adapt Neural Radiance Fields to a 2.5D continuous and differentiable terrain model. In contrast to traditional terrain representations such as digital elevation models, NEMos can be readily generated from imagery, a low-cost data source, and provide a lightweight representation of terrain through an implicit continuous and differentiable height field. We propose a novel method for jointly training a height field and radiance field within a NeRF framework, leveraging quantile regression. Additionally, we introduce a path planning algorithm that performs gradient-based optimization of a continuous cost function for minimizing distance, slope changes, and control effort, enabled by differentiability of the height field. We perform experiments on simulated and real-world terrain imagery, demonstrating NEMos ability to generate high-quality reconstructions and produce smoother paths compared to discrete path planning methods. Future work will explore the incorporation of features and semantics into the height field, creating a generalized terrain model.
这项工作介绍了一种名为神经提升模型(NEMos)的新方法,它将神经辐射场适应到2.5维的连续可导 terrain 模型中。与传统的 terrain 表示方法,如数字高程模型,NEMos 可以从图像数据中轻松生成,是一种低成本数据源,并通过隐式连续和可导的高度场提供地形轻量表示。我们提出了一种在 NeRF 框架内共同训练高度场和辐射场的新方法,利用量化回归。此外,我们介绍了一种基于梯度的路径规划算法,用于最小化距离、坡度变化和控制努力,利用高度场的可导性。我们在模拟和现实世界的地形图像上进行了实验,证明了 NEMos 生成高质量重构和更平滑路径的能力,与离散路径规划方法相比。未来的工作将探索将特征和语义引入高度场中的可能性,创建一个泛化的 terrain 模型。
https://arxiv.org/abs/2405.15227
Reverse engineering CAD models from raw geometry is a classic but challenging research problem. In particular, reconstructing the CAD modeling sequence from point clouds provides great interpretability and convenience for editing. To improve upon this problem, we introduce geometric guidance into the reconstruction network. Our proposed model, PS-CAD, reconstructs the CAD modeling sequence one step at a time. At each step, we provide two forms of geometric guidance. First, we provide the geometry of surfaces where the current reconstruction differs from the complete model as a point cloud. This helps the framework to focus on regions that still need work. Second, we use geometric analysis to extract a set of planar prompts, that correspond to candidate surfaces where a CAD extrusion step could be started. Our framework has three major components. Geometric guidance computation extracts the two types of geometric guidance. Single-step reconstruction computes a single candidate CAD modeling step for each provided prompt. Single-step selection selects among the candidate CAD modeling steps. The process continues until the reconstruction is completed. Our quantitative results show a significant improvement across all metrics. For example, on the dataset DeepCAD, PS-CAD improves upon the best published SOTA method by reducing the geometry errors (CD and HD) by 10%, and the structural error (ECD metric) by about 15%.
逆向工程CAD模型从原始几何是一种经典的但具有挑战性的研究问题。特别是,从点云中重构CAD建模序列提供了很大的可解释性和便利性,用于编辑。为了在这个问题上有所突破,我们将几何指导引入了重构网络中。我们提出的模型PS-CAD逐一步重构CAD建模序列。在每一步,我们提供了两种形式的几何指导。首先,我们提供当前重构与完整模型之间的几何形状,这有助于框架将注意力集中在仍然需要改进的区域上。其次,我们使用几何分析提取了一组平滑提示,这些提示对应于可以开始CAD扩展步骤的候选表面。我们的框架有三个主要组件。几何指导计算提取了两种类型的几何指导。一步重构计算为每个提供的提示选择一个候选CAD建模步骤。单步选择在候选CAD建模步骤之间进行选择。过程继续 until 重构完成。我们的定量结果表明,在所有指标上都取得了显著的改进。例如,在数据集DeepCAD中,PS-CAD比最佳已发表方法提高了10%的形状误差(CD和HD)和约15%的结构误差(ECD指标)。
https://arxiv.org/abs/2405.15188
3D Gaussian Splatting (3DGS) has already become the emerging research focus in the fields of 3D scene reconstruction and novel view synthesis. Given that training a 3DGS requires a significant amount of time and computational cost, it is crucial to protect the copyright, integrity, and privacy of such 3D assets. Steganography, as a crucial technique for encrypted transmission and copyright protection, has been extensively studied. However, it still lacks profound exploration targeted at 3DGS. Unlike its predecessor NeRF, 3DGS possesses two distinct features: 1) explicit 3D representation; and 2) real-time rendering speeds. These characteristics result in the 3DGS point cloud files being public and transparent, with each Gaussian point having a clear physical significance. Therefore, ensuring the security and fidelity of the original 3D scene while embedding information into the 3DGS point cloud files is an extremely challenging task. To solve the above-mentioned issue, we first propose a steganography framework for 3DGS, dubbed GS-Hider, which can embed 3D scenes and images into original GS point clouds in an invisible manner and accurately extract the hidden messages. Specifically, we design a coupled secured feature attribute to replace the original 3DGS's spherical harmonics coefficients and then use a scene decoder and a message decoder to disentangle the original RGB scene and the hidden message. Extensive experiments demonstrated that the proposed GS-Hider can effectively conceal multimodal messages without compromising rendering quality and possesses exceptional security, robustness, capacity, and flexibility. Our project is available at: this https URL.
3D Gaussian Splatting(3DGS)已成为3D场景建模和新颖视图生成的研究领域新兴的研究焦点。由于训练3DGS需要大量的时间和计算成本,因此保护3D资产的版权、完整性和隐私至关重要。加密传输和版权保护的关键技术—— steganography(隐写术)已经得到了广泛研究。然而,它仍然缺乏针对3DGS的深入研究。与先驱NeRF不同,3DGS具有两个显著的特点:1)显式3D表示;2)实时渲染速度。这些特点导致3DGS点云文件为公共且透明的,每个高斯点具有明确的物理意义。因此,在将信息嵌入3DGS点云文件的同时确保原始3D场景的安全性和完整性是一个极其具有挑战性的任务。为解决上述问题,我们首先提出了一个名为GS-Hider的隐写术框架,用于3DGS,它可以在不可见的情况下将3D场景和图像嵌入原始GS点云中,并准确提取隐藏信息。具体来说,我们设计了一个结合安全特征的共轭特征属性来替换原始3DGS的球谐余弦系数,然后使用场景解码器和消息解码器来分离原始的RGB场景和隐藏信息。大量的实验证明,与GS-Hider一起,可以有效地隐藏多模态信息,同时不牺牲渲染质量和安全性。我们的项目可以从以下链接访问:https:// this URL。
https://arxiv.org/abs/2405.15118
Purpose: To develop and evaluate a deep learning model for general accelerated MRI reconstruction. Materials and Methods: This retrospective study built a magnetic resonance image processing transformer (MR-IPT) which includes multi-head-tails and a single shared window transformer main body. Three mutations of MR-IPT with different transformer structures were implemented to guide the design of our MR-IPT model. Pre-trained on the MRI set of RadImageNet including 672675 images with multiple anatomy categories, the model was further migrated and evaluated on fastMRI knee dataset with 25012 images for downstream reconstruction tasks. We performed comparison studies with three CNN-based conventional networks in zero- and few-shot learning scenarios. Transfer learning process was conducted on both MR-IPT and CNN networks to further validate the generalizability of MR-IPT. To study the model performance stability, we evaluated our model with various downstream dataset sizes ranging from 10 to 2500 images. Result: The MR-IPT model provided superior performance in multiple downstream tasks compared to conventional CNN networks. MR-IPT achieved a PSNR/SSIM of 26.521/0.6102 (4-fold) and 24.861/0.4996 (8-fold) in 10-epoch learning, surpassing UNet128 at 25.056/0.5832 (4-fold) and 22.984/0.4637 (8-fold). With the same large-scale pre-training, MR-IPT provided a 5% performance boost compared to UNet128 in zero-shot learning in 8-fold and 3% in 4-fold. Conclusion: MR-IPT framework benefits from its transformer-based structure and large-scale pre-training and can serve as a solid backbone in other downstream tasks with zero- and few-shot learning.
目的:开发和评估用于通用加速磁共振成像重建的深度学习模型。材料和方法:本回顾性研究构建了一个包含多端和单共享窗口变换器主体的大规模磁共振图像处理变换器(MR-IPT)。为了引导我们MR-IPT模型的设计,实施了三种不同变换器结构的MR-IPT的三个突变。在包括多个解剖类别的MRI数据集上进行预训练,然后迁移到快速MRI膝盖数据集(包括25012个图像)以进行下游重建任务。我们进行了基于CNN的 conventional网络在零和少样本学习场景下的比较研究。在零和少样本学习场景下,对MR-IPT和CNN网络进行了迁移学习以进一步验证MR-IPT的泛化能力。为了研究模型的性能稳定性,我们评估了我们的模型在各种下游数据集上的性能,从10到2500个图像。结果:与传统CNN网络相比,MR-IPT模型在多个下游任务上提供了卓越的性能。MR-IPT在10个周期学习期间获得了26.521/0.6102(4倍)和24.861/0.4996(8倍)的PSNR/SSIM值,超过了UNet128的25.056/0.5832(4倍)和22.984/0.4637(8倍)。与使用相同的大规模预训练相比,在零和少样本学习场景中,MR-IPT提供了比UNet1285%的性能提升。结论:基于变换器的MR-IPT框架受益于其转换器结构和大规模预训练,可以在零和少样本学习场景中作为其他下游任务的可靠基础。
https://arxiv.org/abs/2405.15098
The DreamerV3 agent recently demonstrated state-of-the-art performance in diverse domains, learning powerful world models in latent space using a pixel reconstruction loss. However, while the reconstruction loss is essential to Dreamer's performance, it also necessitates modeling unnecessary information. Consequently, Dreamer sometimes fails to perceive crucial elements which are necessary for task-solving when visual distractions are present in the observation, significantly limiting its potential. In this paper, we present MuDreamer, a robust reinforcement learning agent that builds upon the DreamerV3 algorithm by learning a predictive world model without the need for reconstructing input signals. Rather than relying on pixel reconstruction, hidden representations are instead learned by predicting the environment value function and previously selected actions. Similar to predictive self-supervised methods for images, we find that the use of batch normalization is crucial to prevent learning collapse. We also study the effect of KL balancing between model posterior and prior losses on convergence speed and learning stability. We evaluate MuDreamer on the commonly used DeepMind Visual Control Suite and demonstrate stronger robustness to visual distractions compared to DreamerV3 and other reconstruction-free approaches, replacing the environment background with task-irrelevant real-world videos. Our method also achieves comparable performance on the Atari100k benchmark while benefiting from faster training.
DreamerV3 代理最近在各种领域展示了最先进的表现,通过在潜在空间中使用像素重构损失来学习强大的世界模型。然而,虽然重构损失对 Dreamer 的性能至关重要,但它也要求建模不必要的信息。因此,Dreamer 有时在有视觉干扰的情况下无法感知必要的元素,这严重限制了其潜在能力。在本文中,我们提出了 MuDreamer,一个基于 DreamerV3 算法的稳健强化学习代理,通过预测环境值函数来学习不需要重构输入信号的世界模型。我们发现,与图像预测自监督方法类似,批归一化对于防止学习崩塌非常重要。此外,我们还研究了 KL 平衡在模型后验和先验损失之间对收敛速度和学习稳定性的影响。我们在常用的 DeepMind 视觉控制套件上评估了 MuDreamer,并证明了其对视觉干扰的鲁棒性比 DreamerV3 和其他无重构方法的性能更强,将环境背景替换为与任务无关的实时视频。我们的方法在 Atari100k 基准上同样具有可比性能,同时训练速度更快。
https://arxiv.org/abs/2405.15083
Surface Electromyography (sEMG) is a non-invasive signal that is used in the recognition of hand movement patterns, the diagnosis of diseases, and the robust control of prostheses. Despite the remarkable success of recent end-to-end Deep Learning approaches, they are still limited by the need for large amounts of labeled data. To alleviate the requirement for big data, researchers utilize Feature Engineering, which involves decomposing the sEMG signal into several spatial, temporal, and frequency features. In this paper, we propose utilizing a feature-imitating network (FIN) for closed-form temporal feature learning over a 300ms signal window on Ninapro DB2, and applying it to the task of 17 hand movement recognition. We implement a lightweight LSTM-FIN network to imitate four standard temporal features (entropy, root mean square, variance, simple square integral). We then explore transfer learning capabilities by applying the pre-trained LSTM-FIN for tuning to a downstream hand movement recognition task. We observed that the LSTM network can achieve up to 99\% R2 accuracy in feature reconstruction and 80\% accuracy in hand movement recognition. Our results also showed that the model can be robustly applied for both within- and cross-subject movement recognition, as well as simulated low-latency environments. Overall, our work demonstrates the potential of the FIN modeling paradigm in data-scarce scenarios for sEMG signal processing.
表面电生理(sEMG)是一种无创的信号,用于识别手部运动模式、疾病诊断和假肢控制的稳健控制。尽管最近端到端深度学习方法取得了令人印象深刻的成功,但它们仍然受到需要大量标记数据的需求的限制。为了减轻对大量数据的依赖,研究人员利用特征工程,涉及将sEMG信号分解为几个空间、时间和频率特征。在本文中,我们提出了一种特征模仿网络(FIN)在Ninapro DB2上对300ms信号窗口进行开形式时特征学习,并将其应用于17个手部运动识别任务。我们实现了一个轻量级的LSTM-FIN网络,模仿了四个标准的时间特征(熵、均方根、方差和简单平方积分)。然后,通过将预训练的LSTM-FIN用于下游手部运动识别任务的调优,探讨了迁移学习能力。我们观察到,LSTM网络在特征重构方面可以达到99%的R2准确度,在手部运动识别方面可以达到80%的准确度。我们的结果还表明,该模型可以应用于跨subject和within-subject的手部运动识别,以及模拟低延迟环境。总体而言,我们的工作展示了在数据有限的情况下,FIN建模范式在sEMG信号处理中的潜在可能性。
https://arxiv.org/abs/2405.19356
Recently, point cloud processing and analysis have made great progress due to the development of 3D Transformers. However, existing 3D Transformer methods usually are computationally expensive and inefficient due to their huge and redundant attention maps. They also tend to be slow due to requiring time-consuming point cloud sampling and grouping processes. To address these issues, we propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing. Firstly, we propose an efficient Learnable Token Sparsification (LTS) block, which considers both local and global semantic information for the adaptive selection of key tokens. Secondly, to achieve the feature aggregation for sparsified tokens, we present the first Dynamic Token Aggregating (DTA) block in the 3D Transformer paradigm, providing our model with strong aggregated features while preventing information loss. After that, a dual-attention Transformer-based Global Feature Enhancement (GFE) block is used to improve the representation capability of the model. Equipped with LTS, DTA, and GFE blocks, DTA-Former achieves excellent classification results via hierarchical feature learning. Lastly, a novel Iterative Token Reconstruction (ITR) block is introduced for dense prediction whereby the semantic features of tokens and their semantic relationships are gradually optimized during iterative reconstruction. Based on ITR, we propose a new W-net architecture, which is more suitable for Transformer-based feature learning than the common U-net design. Extensive experiments demonstrate the superiority of our method. It achieves SOTA performance with up to 30$\times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
近年来,由于3DTransformers的发展,点云处理和分析取得了很大的进展。然而,由于它们具有庞大的冗余关注图,现有的3DTransformers方法通常计算复杂度高且效率低。它们还倾向于因为需要耗时的点云采样和聚类过程而变得缓慢。为了应对这些问题,我们提出了一个高效的点Transformer-Dynamic Token Aggregating(DTA-Former)模型,用于点云表示和处理。 首先,我们提出了一个高效的可学习词稀疏性(LTS)块,考虑了关键词词稀疏性中的局部和全局语义信息,以适应选择关键词。其次,为了实现对稀疏词的聚合,我们在3DTransformer范式中提出了第一个动态词稀疏性(DTA)块,为我们的模型提供了强大的聚合特征,同时防止了信息损失。然后,一个基于Transformer的双注意权重增强(GFE)块被用于提高模型的表示能力。配备LTS、DTA和GFE块,DTA-Former通过分层特征学习实现了卓越的分类结果。最后,我们引入了一个新的迭代词重构(ITR)块,用于提高密钥预测,在每次迭代重构过程中逐步优化词的语义特征及其语义关系。基于ITR,我们提出了一个新的W-net架构,比常见的U-net设计更适合基于Transformer的特征学习。大量实验证明了我们方法的优势。 我们的方法在ModelNet40、ShapeNet和空气中的多光谱激光雷达(MS-LiDAR)数据集上的SOTA性能可以达到30$\times$,远高于之前点Transformer。
https://arxiv.org/abs/2405.15827
3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.
3D 变换器在点云理解和表示方面取得了巨大的成功。然而,在大型 LiDAR 点云场景分割方面,还有很大的发展空间。本文提出了一种新颖的 3D 可学习超词Transformer (3DLST)。其关键贡献总结如下。首先,我们引入了第一个动态超词优化(DSO)块,用于高效的点聚类和聚合,其中可学习的超词定义避免了传统超点生成的预处理时间。由于可学习的超词可以在网络学习过程中动态优化,因此它们专门针对语义同质性关注的点聚类。其次,提出了一个高效的跨注意力引导上采样(CAU)块,用于从优化后的超词进行标记的token重构。第三,3DLST 配备了新颖的 W-net 架构,而不是常见的 U-net 设计,更适合基于 Transformer 的特征学习。在三个具有挑战性的 LiDAR 数据集(航空多光谱 LiDAR(MS-LiDAR)(89.3% 的平均 F1 分数)、DALES (80.2% 的mIoU) 和多伦多 3D 数据集(80.4% 的mIoU)上的最先进性能表明了 3DLST 及其对各种 LiDAR 点云数据(航空 MS-LiDAR、航空 LiDAR 和车载 LiDAR 数据)的优越性。此外,3DLST 在算法效率方面也取得了令人满意的结果,这是以前最佳表现方法的5倍。
https://arxiv.org/abs/2405.15826
Event cameras offer promising advantages such as high dynamic range and low latency, making them well-suited for challenging lighting conditions and fast-moving scenarios. However, reconstructing 3D scenes from raw event streams is difficult because event data is sparse and does not carry absolute color information. To release its potential in 3D reconstruction, we propose the first event-based generalizable 3D reconstruction framework, called EvGGS, which reconstructs scenes as 3D Gaussians from only event input in a feedforward manner and can generalize to unseen cases without any retraining. This framework includes a depth estimation module, an intensity reconstruction module, and a Gaussian regression module. These submodules connect in a cascading manner, and we collaboratively train them with a designed joint loss to make them mutually promote. To facilitate related studies, we build a novel event-based 3D dataset with various material objects and calibrated labels of grayscale images, depth maps, camera poses, and silhouettes. Experiments show models that have jointly trained significantly outperform those trained individually. Our approach performs better than all baselines in reconstruction quality, and depth/intensity predictions with satisfactory rendering speed.
活动相机具有诸如高动态范围和低延迟等有前途的优点,使其非常适合具有挑战性的照明条件和高速移动场景。然而,从原始事件流中重构3D场景是困难的,因为事件数据稀疏,不包含绝对颜色信息。为了释放在3D重构中的潜力,我们提出了第一个基于活动的3D重建框架,称为EvGGS,它通过仅从事件输入进行前馈方式重构场景,并可以在没有任何重新训练的情况下推广到未见过的案例。该框架包括深度估计模块、强度估计模块和高斯回归模块。这些子模块以级联方式连接,我们与设计联合损失合作共同训练它们,使它们相互促进。为了促进相关研究,我们构建了一个包含各种材料物体和灰度图像、深度图、相机姿态和轮廓的活动基于3D数据集。实验结果表明,与单独训练的模型相比,共同训练的模型性能显著更好。我们的方法在重建质量、深度/强度预测以及具有令人满意的渲染速度方面优于所有基线。
https://arxiv.org/abs/2405.14959
Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our model and data will be public.
生成个性化的3D虚拟人对于AR/VR非常重要。然而,用于生成虚拟人,尤其是为名人或虚构角色生成虚拟人的最近文本-到3D方法,在现实生活中的人的表现往往不佳。忠实再现通常需要受控环境中的全身照片。如果用户只需上传他们的个人“OOTD”照片集,就能获得一个忠实再现的3D虚拟人,那么这种方法存在什么挑战呢?挑战在于,这样的 casual photo collections 包含多样姿态、具有挑战性的视角、裁剪视角和遮盖(尽管服装、配件和发型是一致的,但带有变化的姿态,视角和遮盖)。为了应对这种新颖的“Album2Human”任务,我们开发了 PuzzleAvatar,一种从个人OOTD相册中生成忠实3D虚拟人的新模型,同时绕过了对身体和相机姿态具有挑战性的估计。为此,我们在这些照片上微调了一个基本 vision-language model(VLM),将一个人的外貌、身份、衣服、发型和配饰编码成(分开)学习到的标记,并将这些线索注入 VLM。实际上,我们利用学习到的标记作为“ puzzle pieces”,从中组装出忠实、个性化的3D虚拟人。重要的是,我们可以通过简单地交换标记来定制虚拟人。作为这项新任务的基准,我们收集了一个名为 PuzzleIOI的新数据集,其中包含41个主题,总共有近1K个OOTD配置,在具有成对地面真实3D身体的挑战性部分照片上进行评估。评估显示,PuzzleAvatar不仅具有较高的重建准确性,超过了 TeCH 和 MV Dream Booth,还具有独特的可扩展性,能够应对专辑照片,而且具有很强的鲁棒性。我们的模型和数据将公开发布。
https://arxiv.org/abs/2405.14869
Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose $\textbf{GCD}$, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
准确从单个视角重建复杂动态场景继续是计算机视觉中的一个具有挑战性的任务。当前的动态新型视角合成方法通常需要从许多不同的相机视角的视频中提取视频,需要仔细的录音设置,在野外的应用以及在 embodied AI 应用方面也大大限制了其实用性。在本文中,我们提出了 GCD,一个可控制的一目了然的动态视角合成管道,利用大规模扩散优先权图生成,给定任何场景的视频,从所选的任意视角生成一个同步视频, conditioned on a set of relative camera pose parameters。我们的模型不需要深度作为输入,也没有明确地建模3D场景几何,而是通过端到端视频到视频的翻译来达到其目标,以实现其目标的高效性。尽管仅在合成多视角视频数据上进行训练,但零散真实世界一般化实验在多个领域也取得了积极的结果,包括机器人学、物体持久性以及虚拟现实中的交互式3D视频观看体验。我们相信,我们的框架有可能在丰富的动态场景理解、机器人的感知以及虚拟现实中的互动式3D视频观看体验方面发挥强大的作用。
https://arxiv.org/abs/2405.14868
Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatio-temporal coherency and dynamic scene constraints. Together, they lead to consistent reconstructions of camera trajectories, human meshes, and dense scene point clouds in a common world frame. Project page: this https URL
令人印象深刻的进步已经是从单目视频重构静态场景或人体。然而,这两个问题主要是独立解决,缺乏很大的协同作用。大多数视觉SLAM方法只能重构到比例尺的相机轨迹和场景结构,而大多数HMR方法在重构人体网格的同时存在推理不足的问题。本文介绍了一种 Synergistic Camera and Human Reconstruction (SynCHMR) 方法,将两种方法的优势结合在一起。具体来说,我们设计了一种基于相机帧的人体感知SLAM,利用相机帧的HMR作为强大的先验来重构比例尺的相机姿态和场景点云,解决深度、尺寸和动态不确定性。通过基于密集场景的恢复,我们进一步学习了一种场景感知的人体运动建模器(SMPL),通过引入时空一致性来增强世界框HMR。结合这两种方法,它们在共同的世界帧中实现了对相机轨迹、人体网格和密集场景点云的一致重构。项目页面:这个链接
https://arxiv.org/abs/2405.14855