The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.
计算机辅助分割方法可以帮助医疗人员提高诊断结果。虽然像UNet及其变体这样的最近进展显示出前景,但它们面临着一个关键挑战:平衡准确性和计算效率。UNet中的浅层编码器架构通常很难捕捉关键的空间特征,导致不准确和稀疏分割。为了应对这个局限,我们提出了一个新颖的移动UNet(PAM-UNet)架构。PAM-UNet中的倒置残差(IR)块有助于保持轻量级框架,而逐层的PLA(渐进式洪注意力)通过将注意力指向感兴趣区域在合成过程中进行定向,促进了精确分割。我们的方法将准确性和速度优先考虑,实现了74.65的均IoU和82.87的 dice分数,同时仅在LiTS 2017数据集上需要每秒1.32个浮点运算(FLOPs)。这些结果强调了开发高效的分割模型以加速人工智能在临床实践中的采用的重要性。
https://arxiv.org/abs/2405.01503
Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.
近年来,手物体重建主要集中在单视图和密集多视图设置。一方面,单视图方法可以利用学习到的形状先验来推广到未见到的物体,但由于遮挡而存在误差。另一方面,密集多视图方法非常准确,但需要进一步的数据收集才能适应未见到的物体。相比之下,稀疏多视图方法可以利用附加的视图来解决遮挡问题,而 computational cost 较低,相对于密集多视图方法。在本文中,我们考虑在稀疏多视图设置下处理未见物体的问题。 给定同时捕捉到同一手和物体的多个 RGB 图像,我们的模型 SVHO 将每个视图的预测合并为统一的重建,无需在视图中进行优化。我们在合成手-物体数据集上训练模型,并直接在真实世界记录的手-物体数据集上进行评估。我们证明了,虽然从 RGB 重建未见的手和物体具有挑战性,但附加视图可以提高重建质量。
https://arxiv.org/abs/2405.01353
Edge computing allows artificial intelligence and machine learning models to be deployed on edge devices, where they can learn from local data and collaborate to form a global model. Federated learning (FL) is a distributed machine learning technique that facilitates this process while preserving data privacy. However, FL also faces challenges such as high computational and communication costs regarding resource-constrained devices, and poor generalization performance due to the heterogeneity of data across edge clients and the presence of out-of-distribution data. In this paper, we propose the Gradient-Congruity Guided Federated Sparse Training (FedSGC), a novel method that integrates dynamic sparse training and gradient congruity inspection into federated learning framework to address these issues. Our method leverages the idea that the neurons, in which the associated gradients with conflicting directions with respect to the global model contain irrelevant or less generalized information for other clients, and could be pruned during the sparse training process. Conversely, the neurons where the associated gradients with consistent directions could be grown in a higher priority. In this way, FedSGC can greatly reduce the local computation and communication overheads while, at the same time, enhancing the generalization abilities of FL. We evaluate our method on challenging non-i.i.d settings and show that it achieves competitive accuracy with state-of-the-art FL methods across various scenarios while minimizing computation and communication costs.
边缘计算允许人工智能和机器学习模型在边缘设备上部署,从本地数据中学习并协同形成全局模型。联邦学习(FL)是一种分布式机器学习技术,它通过保留数据隐私来促进这一过程。然而,FL也面临着一些挑战,如资源受限设备的计算和通信成本较高,以及由于边缘客户端数据异质性和存在离散数据而导致的泛化性能较差。在本文中,我们提出了 Gradient-Congruity Guided Federated Sparse Training (FedSGC) 方法,一种将动态稀疏训练和梯度一致性检查集成到联邦学习框架中的新方法,以解决这些问题。我们的方法利用了神经元中与全局模型相关但方向不一致的梯度包含无关或较少泛化信息的假设,并可以在稀疏训练过程中进行剪枝。相反,与全局模型方向一致的梯度可以以更高的优先级进行生长。这样,FedSGC 可以在降低本地计算和通信开销的同时,增强 FL 的泛化能力。我们在具有挑战性的非均匀设置中评估了我们的方法,结果表明,它在不同场景下的竞争精度与最先进的 FL 方法相当,同时最小化计算和通信成本。
https://arxiv.org/abs/2405.01189
Hyperspectral Imaging (HSI) serves as an important technique in remote sensing. However, high dimensionality and data volume typically pose significant computational challenges. Band selection is essential for reducing spectral redundancy in hyperspectral imagery while retaining intrinsic critical information. In this work, we propose a novel hyperspectral band selection model by decomposing the data into a low-rank and smooth component and a sparse one. In particular, we develop a generalized 3D total variation (G3DTV) by applying the $\ell_1^p$-norm to derivatives to preserve spatial-spectral smoothness. By employing the alternating direction method of multipliers (ADMM), we derive an efficient algorithm, where the tensor low-rankness is implied by the tensor CUR decomposition. We demonstrate the effectiveness of the proposed approach through comparisons with various other state-of-the-art band selection techniques using two benchmark real-world datasets. In addition, we provide practical guidelines for parameter selection in both noise-free and noisy scenarios.
超分辨率成像(HSI)在遥感中是一个重要的技术。然而,高维度和数据量通常会带来显著的计算挑战。带选择对于在超分辨率图像中减少光谱重叠并保留固有关键信息至关重要。在这项工作中,我们提出了一种新的超分辨率带选择模型,通过将数据分解为低秩和平滑组件和稀疏组件。特别,我们通过应用$\ell_1^p$范数来保留空间-频谱平滑性,开发了一个通用的3D总方差(G3DTV)。通过采用交替方向乘子法(ADMM),我们推导出一种高效的算法,其中张量低秩性隐含于张量CUR分解。我们通过与各种最先进的带选择技术进行比较,证明了所提出方法的有效性。此外,我们还为噪声无党和噪声场景提供了参数选择的实际建议。
https://arxiv.org/abs/2405.00951
Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.
多元智能体强化学习(MARL)算法通常很难找到接近帕累托最优纳什均衡的战略,这很大程度上是因为缺乏有效的探索。在稀疏奖励环境中,问题进一步加剧,由于策略学习表现出更大的方差。本文介绍了一种名为MESA的新协作多智能体学习元探索方法。它通过首先从训练任务中确定代理器的局部高奖励状态-动作子空间,然后学习一系列多样化的探索策略来“覆盖”该子空间。这些训练探索策略可以与任何离散的MARL算法在测试时间任务中集成。我们首先展示了MESA在多级矩阵游戏中的优势。此外,实验结果表明,在稀疏奖励任务中,MESA在多个多智能体粒子环境和多智能体MuJoCo环境中取得了显著的更好的性能,并且具有在测试时间将任务泛化的能力。
https://arxiv.org/abs/2405.00902
Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
照片现实模拟在自动驾驶等应用中扮演着关键角色,因为神经辐射场(NeRFs)的进步可能允许通过自动创建数字3D资产来实现更好的可扩展性。然而,在街景中,由于主要是平行的相机运动和高速时的采样稀疏,重建质量下降。另一方面,应用程序通常要求从相机视角进行渲染,以准确模拟行为,如变道。在本文中,我们提出了几个见解,使得Lidar数据能够更好地用于改善街景中的NeRF质量。首先,我们的框架从Lidar中学习几何场景表示,并将其与隐式网格表示的辐射解码相结合,从而提供来自明确点云的更强的几何信息。其次,我们提出了一个鲁棒的可视化深度监督方案,允许通过累积使用密集的Lidar点。第三,我们从Lidar点生成增强的训练视图,以进一步改进。我们的见解使得在现实驾驶场景中产生了显著改进的新视图合成。
https://arxiv.org/abs/2405.00900
Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
Neural Radiance Fields (NeRF) 在 3D 重建和生成新视图方面已经取得了令人印象深刻的成果。 NeRF 中的关键挑战之一是编辑重构场景,例如物体移除,这需要在多个视图中保持一致并确保高质合成视角。之前的研究已经利用深度优先项,通常来自 LiDAR 或稀疏深度测量提供的 COLMAP,来提高 NeRF 中物体移除的性能。然而,这些方法要么代价高昂,要么费时。在本文中,我们提出了一种新方法,将单目深度估计与基于 NeRF 的物体移除模型相结合,显著减少了时间消耗,并提高了场景生成和物体移除的稳健性和质量。我们对 COLMAP 在 KITTI 数据集上的密集深度重建进行了详细的评估,以验证其深度图生成的准确性。我们的研究结果表明,COLMAP 可以作为当深度图缺失或昂贵无法获得时的有效地面真值深度图的替代。此外,我们将各种单目深度估计方法(例如 SpinNeRF)集成到移除 NeRF 模型中,以评估它们提高物体移除性能的能力。我们的实验结果突出了单目深度估计在极大地改善 NeRF 应用中的潜力。
https://arxiv.org/abs/2405.00630
Activity and parameter sparsity are two standard methods of making neural networks computationally more efficient. Event-based architectures such as spiking neural networks (SNNs) naturally exhibit activity sparsity, and many methods exist to sparsify their connectivity by pruning weights. While the effect of weight pruning on feed-forward SNNs has been previously studied for computer vision tasks, the effects of pruning for complex sequence tasks like language modeling are less well studied since SNNs have traditionally struggled to achieve meaningful performance on these tasks. Using a recently published SNN-like architecture that works well on small-scale language modeling, we study the effects of weight pruning when combined with activity sparsity. Specifically, we study the trade-off between the multiplicative efficiency gains the combination affords and its effect on task performance for language modeling. To dissect the effects of the two sparsities, we conduct a comparative analysis between densely activated models and sparsely activated event-based models across varying degrees of connectivity sparsity. We demonstrate that sparse activity and sparse connectivity complement each other without a proportional drop in task performance for an event-based neural network trained on the Penn Treebank and WikiText-2 language modeling datasets. Our results suggest sparsely connected event-based neural networks are promising candidates for effective and efficient sequence modeling.
活动稀疏和参数稀疏是两种使神经网络在计算上更加高效的标准方法。像尖峰神经网络(SNNs)这样的事件驱动架构自然表现出活动稀疏性,而且有许多方法通过剪枝权重来稀疏它们的连接。虽然之前已经研究了权重剪枝对前馈SNN的影响,但对于像语言建模这样的复杂序列任务,剪枝的影响还没有得到很好的研究,因为SNN传统上在这些任务上取得有意义的性能很困难。使用一个最近发表的SNN类似架构,在小型语言建模上表现良好,我们研究了权重剪枝与活动稀疏的联合效果。具体来说,我们研究了活动稀疏和稀疏连接之间的权衡,以及这对语言建模任务性能的影响。为了深入剖析这两种稀疏性,我们在不同连接稀疏程度的模型之间进行比较分析。我们证明了活动稀疏和稀疏连接可以相互补充,而不会降低任务性能。对于在Penn Treebank和WikiText-2数据集上训练的事件驱动神经网络,我们的结果表明,稀疏连接的事件驱动神经网络是有效的且高效的序列建模候选者。
https://arxiv.org/abs/2405.00433
In many practical applications, it is often difficult and expensive to obtain large-scale labeled data to train state-of-the-art deep neural networks. Therefore, transferring the learned knowledge from a separate, labeled source domain to an unlabeled or sparsely labeled target domain becomes an appealing alternative. However, direct transfer often results in significant performance decay due to domain shift. Domain adaptation (DA) aims to address this problem by aligning the distributions between the source and target domains. Multi-source domain adaptation (MDA) is a powerful and practical extension in which the labeled data may be collected from multiple sources with different distributions. In this survey, we first define various MDA strategies. Then we systematically summarize and compare modern MDA methods in the deep learning era from different perspectives, followed by commonly used datasets and a brief benchmark. Finally, we discuss future research directions for MDA that are worth investigating.
在许多实际应用中,获得大规模带标签数据来训练最先进的深度神经网络通常很难且费用高昂。因此,将从单独的带标签源域中学习到的知识转移到未标记或稀疏标记的目标域,成为一种有吸引力的替代方法。然而,直接转移往往会导致由于领域漂移而导致的性能下降。领域适应(DA)旨在解决这个问题,通过将源域和目标域之间的分布对齐。多源域适应(MDA)是一种强大的实用方法,其中带标签数据可以从多个具有不同分布的来源进行收集。在本次调查中,我们首先定义了各种MDA策略。然后,我们从不同的角度系统地总结了和比较了当代MDA方法在深度学习时代的好方法。接着,我们讨论了常用数据集以及一个简单的基准。最后,我们讨论了值得研究的研究方向,这些方向有助于MDA的发展。
https://arxiv.org/abs/2405.00749
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: this https URL .
我们提出了GS-LRM,一个可扩展的大型重构模型,可以在0.23秒内预测高质量的3D高斯基本结构,从2-4个 posed稀疏图像中。我们的模型具有非常简单的Transformer架构;我们通过一系列Transformer模块对输入的对称图像进行补丁,并通过这些模块的串联对多视角图像令牌进行传递。为了进行可变渲染,我们直接从这些令牌中解码最终每个像素的高斯参数。与之前的LRM不同,它们只能重构物体,而GS-LRM通过预测每个像素的高斯函数,自然地处理规模和复杂度的大型场景。我们分别将模型在Objaverse和RealEstate10K上训练,证明了我们的模型可以应用于物体和场景捕捉。在两种情景下,模型在基线模型之上取得了很大的优势。我们还展示了我们的模型在下游3D生成任务中的应用。我们的项目网页可用于此链接:https://this URL 。
https://arxiv.org/abs/2404.19702
In recent years, zero-shot learning has attracted the focus of many researchers, due to its flexibility and generality. Many approaches have been proposed to achieve the zero-shot classification of the point clouds for 3D object understanding, following the schema of CLIP. However, in the real world, the point clouds could be extremely sparse, dramatically limiting the effectiveness of the 3D point cloud encoders, and resulting in the misalignment of point cloud features and text embeddings. To the point cloud encoders to fit the extremely sparse point clouds without re-running the pre-training procedure which could be time-consuming and expensive, in this work, we propose an unsupervised model adaptation approach to enhance the point cloud encoder for the extremely sparse point clouds. We propose a novel fused-cross attention layer that expands the pre-trained self-attention layer with additional learnable tokens and attention blocks, which effectively modifies the point cloud features while maintaining the alignment between point cloud features and text embeddings. We also propose a complementary learning-based self-distillation schema that encourages the modified features to be pulled apart from the irrelevant text embeddings without overfitting the feature space to the observed text embeddings. Extensive experiments demonstrate that the proposed approach effectively increases the zero-shot capability on extremely sparse point clouds, and overwhelms other state-of-the-art model adaptation approaches.
近年来,由于其灵活性和普适性,零样本学习(Zero-Shot Learning)吸引了许多研究人员的关注。为了实现3D物体理解中点云的零样本分类,许多方法提出了基于CLIP的方案。然而,在现实生活中,点云可能非常稀疏,极大地限制了3D点云编码器的有效性,并导致点云特征与文本嵌入之间的不匹配。为了适应稀疏的点云,避免重新进行预训练,我们在这个工作中提出了一个无监督的模型适应方法,以增强适应稀疏点云的点云编码器。我们提出了一个新颖的融合跨注意层,通过增加可学习标记和注意力模块,扩展了预训练的自注意力层,有效地修改点云特征,同时保持点云特征与文本嵌入之间的对齐。我们还提出了一个基于互补学习的自监督损失模式,鼓励修改后的特征从相关的文本嵌入中分离出来,以避免对特征空间对观察到的文本嵌入过拟合。大量实验证明,与最先进的模型适应方法相比,所提出的方案在稀疏点云上显著提高了零样本能力,并超越了其他方法。
https://arxiv.org/abs/2404.19639
In this work, we present X-Diffusion, a cross-sectional diffusion model tailored for Magnetic Resonance Imaging (MRI) data. X-Diffusion is capable of generating the entire MRI volume from just a single MRI slice or optionally from few multiple slices, setting new benchmarks in the precision of synthesized MRIs from extremely sparse observations. The uniqueness lies in the novel view-conditional training and inference of X-Diffusion on MRI volumes, allowing for generalized MRI learning. Our evaluations span both brain tumour MRIs from the BRATS dataset and full-body MRIs from the UK Biobank dataset. Utilizing the paired pre-registered Dual-energy X-ray Absorptiometry (DXA) and MRI modalities in the UK Biobank dataset, X-Diffusion is able to generate detailed 3D MRI volume from a single full-body DXA. Remarkably, the resultant MRIs not only stand out in precision on unseen examples (surpassing state-of-the-art results by large margins) but also flawlessly retain essential features of the original MRI, including tumour profiles, spine curvature, brain volume, and beyond. Furthermore, the trained X-Diffusion model on the MRI datasets attains a generalization capacity out-of-domain (e.g. generating knee MRIs even though it is trained on brains). The code is available on the project website this https URL .
在这项工作中,我们提出了X-Diffusion,一种专为磁共振成像(MRI)数据而设计的横向扩散模型。X-Diffusion能够仅从单个MRI切片或从几个切片选项生成整个MRI体积,从而在从极度稀疏观察中产生的MRI的准确度方面设定了新基准。X-Diffusion的独特之处在于其在MRI体积上进行新颖的视图相关训练和推理,允许进行泛化MRI学习。我们的评估跨越了来自BRATS数据集的脑肿瘤MRI和来自英国生物银行数据集的全身体MRI。在英国生物银行数据集中利用成对预注册的Dual-energy X-ray Absorptiometry(DXA)和MRI模式,X-Diffusion能够从单个全身DXA生成详细的3D MRI体积。值得注意的是,生成的MRI不仅在精度上超过了未见过的实例(在很大程度上超越了最先进的水平),而且完美地保留了原始MRI的基本特征,包括肿瘤轮廓、脊柱曲线、脑体积等。此外,在MRI数据集上训练的X-Diffusion模型具有在域外实现泛化能力的潜力(例如,即使它针对的大脑进行训练,也能生成膝盖MRI)。代码可在项目网站上找到,此链接为https://。
https://arxiv.org/abs/2404.19604
While camera-based capture systems remain the gold standard for recording human motion, learning-based tracking systems based on sparse wearable sensors are gaining popularity. Most commonly, they use inertial sensors, whose propensity for drift and jitter have so far limited tracking accuracy. In this paper, we propose Ultra Inertial Poser, a novel 3D full body pose estimation method that constrains drift and jitter in inertial tracking via inter-sensor distances. We estimate these distances across sparse sensor setups using a lightweight embedded tracker that augments inexpensive off-the-shelf 6D inertial measurement units with ultra-wideband radio-based ranging$-$dynamically and without the need for stationary reference anchors. Our method then fuses these inter-sensor distances with the 3D states estimated from each sensor Our graph-based machine learning model processes the 3D states and distances to estimate a person's 3D full body pose and translation. To train our model, we synthesize inertial measurements and distance estimates from the motion capture database AMASS. For evaluation, we contribute a novel motion dataset of 10 participants who performed 25 motion types, captured by 6 wearable IMU+UWB trackers and an optical motion capture system, totaling 200 minutes of synchronized sensor data (UIP-DB). Our extensive experiments show state-of-the-art performance for our method over PIP and TIP, reducing position error from $13.62$ to $10.65cm$ ($22\%$ better) and lowering jitter from $1.56$ to $0.055km/s^3$ (a reduction of $97\%$).
虽然基于相机的捕捉系统仍然是记录人类运动的黄金标准,但基于稀疏可穿戴传感器的学习跟踪系统正在逐渐受到欢迎。最常见的使用惯性传感器,其漂移和抖动使得跟踪准确性受到限制。在本文中,我们提出了Ultra Inertial Poser,一种新颖的3D全身姿态估计方法,通过跨传感器距离约束漂移和抖动。我们使用轻量化的嵌入跟踪器估计这些距离,该跟踪器通过超宽带无线电基于动态的无需要静止参考锚点来增强6D惯性测量单位。然后将这些跨传感器距离与来自每个传感器的3D状态估计相结合。我们的基于图的机器学习模型处理3D状态和距离以估计一个人的3D全身姿态和 translation。为了训练我们的模型,我们使用运动捕捉数据库AMASS合成运动捕捉数据中的惯性测量和距离估计。为了评估,我们贡献了一个新的动作数据集,由25种不同的动作组成,由6个可穿戴式IMU+UWB跟踪器和光学运动捕捉系统捕获,总共有200分钟的同步传感器数据(UIP-DB)。我们广泛的实验结果表明,我们的方法在PIP和TIP上具有最先进的性能,将位置误差从$13.62$减少到$10.65$厘米($22\%$的降幅$)$,并将抖动从$1.56$减少到$0.055$千米/秒$^3$($97\%$的降幅)。
https://arxiv.org/abs/2404.19541
Edge vision systems combining sensing and embedded processing promise low-latency, decentralized, and energy-efficient solutions that forgo reliance on the cloud. As opposed to conventional frame-based vision sensors, event-based cameras deliver a microsecond-scale temporal resolution with sparse information encoding, thereby outlining new opportunities for edge vision systems. However, mainstream algorithms for frame-based vision, which mostly rely on convolutional neural networks (CNNs), can hardly exploit the advantages of event-based vision as they are typically optimized for dense matrix-vector multiplications. While event-driven graph neural networks (GNNs) have recently emerged as a promising solution for sparse event-based vision, their irregular structure is a challenge that currently hinders the design of efficient hardware accelerators. In this paper, we propose EvGNN, the first event-driven GNN accelerator for low-footprint, ultra-low-latency, and high-accuracy edge vision with event-based cameras. It relies on three central ideas: (i) directed dynamic graphs exploiting single-hop nodes with edge-free storage, (ii) event queues for the efficient identification of local neighbors within a spatiotemporally decoupled search range, and (iii) a novel layer-parallel processing scheme enabling the low-latency execution of multi-layer GNNs. We deployed EvGNN on a Xilinx KV260 Ultrascale+ MPSoC platform and benchmarked it on the N-CARS dataset for car recognition, demonstrating a classification accuracy of 87.8% and an average latency per event of 16$\mu$s, thereby enabling real-time, microsecond-resolution event-based vision at the edge.
集感和嵌入式处理相结合的边缘视觉系统有望提供低延迟、去中心化和节能的解决方案,摒弃对云计算的依赖。与传统的基于帧的视觉传感器不同,事件驱动的相机在稀疏信息编码下实现微秒级的时序分辨率,为边缘视觉系统提供了新的机会。然而,基于帧的视觉算法,这些算法通常依赖于卷积神经网络(CNNs),很难充分利用事件驱动视觉的优势,因为它们通常为密集矩阵向量乘法优化。虽然事件驱动图神经网络(GNNs)最近作为一种有前景的解决方案涌现出来,但它们的非规则结构是一个挑战,目前阻碍了高效硬件加速器的的设计。在本文中,我们提出了EvGNN,第一个事件驱动的GNN加速器,用于具有事件驱动相机的低功耗、超低延迟和高精度的边缘视觉。它依赖于三个核心理念:(i)指向动态图利用单跳节点和边缘存储的稀疏表示,(ii)在解耦的时空搜索范围内高效识别局部邻居的事件队列,(iii)一种新颖的层并行处理方案,实现多层GNN的低延迟执行。我们在Xilinx KV260 Ultrascale+ MPSoC平台上部署了EvGNN,并在N-CARS数据集上对其进行 benchmark,证明了其分类准确率为87.8%,事件延迟为16μs,从而实现了边缘实时、微秒级视觉。
https://arxiv.org/abs/2404.19489
The main function of depth completion is to compensate for an insufficient and unpredictable number of sparse depth measurements of hardware sensors. However, existing research on depth completion assumes that the sparsity -- the number of points or LiDAR lines -- is fixed for training and testing. Hence, the completion performance drops severely when the number of sparse depths changes significantly. To address this issue, we propose the sparsity-adaptive depth refinement (SDR) framework, which refines monocular depth estimates using sparse depth points. For SDR, we propose the masked spatial propagation network (MSPN) to perform SDR with a varying number of sparse depths effectively by gradually propagating sparse depth information throughout the entire depth map. Experimental results demonstrate that MPSN achieves state-of-the-art performance on both SDR and conventional depth completion scenarios.
深度完成的其主要功能是弥补硬件传感器稀疏测量不足和不可预测的数量。然而,现有的深度完成研究假设训练和测试中深度稀疏度的数量是固定的。因此,当稀疏度数量发生显著变化时,完成性能会急剧下降。为了解决这个问题,我们提出了稀疏度自适应深度修复(SDR)框架,它通过稀疏深度点对单目深度估计进行修复。对于SDR,我们提出了遮罩空间传播网络(MSPN)来通过逐渐传播稀疏深度信息来有效地进行SDR。实验结果表明,MSPN在SDR和传统深度完成场景上都实现了最先进的性能。
https://arxiv.org/abs/2404.19294
We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamental design choices in convolutional neural networks for detecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accurate image matching requires sufficiently large image resolutions - for this reason, we keep the resolution as large as possible while limiting the number of channels in the network. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at this http URL.
我们提出了一个轻量级且准确的资源高效的视觉对应架构。我们的方法被称为XFeat(加速特征),它重新审视了卷积神经网络中用于检测、提取和匹配局部特征的基本设计选择。我们的新模型满足对于资源受限设备快速且鲁棒算法的关键需求。特别是,准确的图像匹配需要足够大的图像分辨率 - 因此,我们在网络中限制通道数量,同时尽可能地保持分辨率。此外,我们的模型还设计为在稀疏或半稀疏级别提供匹配选择,每种选择都可能更适合不同的下游应用,例如视觉导航和增强现实。我们的模型是第一个提供半稀疏匹配的,它依赖于新颖的匹配平滑模块。XFeat具有多才性和硬件无关性,在速度(最高可达5倍)和精度上超过了当前基于深度学习的局部特征,已经在姿态估计和视觉局部定位中得到证明。我们展示它在一台廉价的笔记本电脑CPU上的实时运行,没有专门的硬件优化。代码和权重可以从该网站的URL获取。
https://arxiv.org/abs/2404.19174
We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.
我们提出了一个基于编码图像稀疏、潜在特征表示的视觉场景分析和识别系统。该系统将图像稀疏表示编码为高维向量,然后通过分解为解析场景内容。稀疏特征表示通过卷积稀疏编码从图像统计信息中学习,而场景解析由共振器网络完成。将稀疏编码与共振器网络相结合可以增加分布式表示的容量,并在分解过程中减少组合搜索空间中的碰撞。我们发现,对于这个问题,共振器网络能够实现快速和准确的向量分解,并且我们开发了一个基于信心的度量来协助跟踪共振器网络的收敛。
https://arxiv.org/abs/2404.19126
4D time-space reconstruction of dynamic events or deforming objects using X-ray computed tomography (CT) is an extremely ill-posed inverse problem. Existing approaches assume that the object remains static for the duration of several tens or hundreds of X-ray projection measurement images (reconstruction of consecutive limited-angle CT scans). However, this is an unrealistic assumption for many in-situ experiments that causes spurious artifacts and inaccurate morphological reconstructions of the object. To solve this problem, we propose to perform a 4D time-space reconstruction using a distributed implicit neural representation (DINR) network that is trained using a novel distributed stochastic training algorithm. Our DINR network learns to reconstruct the object at its output by iterative optimization of its network parameters such that the measured projection images best match the output of the CT forward measurement model. We use a continuous time and space forward measurement model that is a function of the DINR outputs at a sparsely sampled set of continuous valued object coordinates. Unlike existing state-of-the-art neural representation architectures that forward and back propagate through dense voxel grids that sample the object's entire time-space coordinates, we only propagate through the DINR at a small subset of object coordinates in each iteration resulting in an order-of-magnitude reduction in memory and compute for training. DINR leverages distributed computation across several compute nodes and GPUs to produce high-fidelity 4D time-space reconstructions even for extremely large CT data sizes. We use both simulated parallel-beam and experimental cone-beam X-ray CT datasets to demonstrate the superior performance of our approach.
使用X射线计算机断层扫描(CT)对动态事件或变形物体进行4D时域重建是一个极其不稳定的反问题。现有的方法假定对象在数十年或数百个X射线投影测量图像(连续有限角CT扫描的重建)期间保持静态。然而,对于许多现场实验,这是不现实的假设,导致伪影和物体形状不准确的重建。为解决这个问题,我们提出了一种使用分布式隐式神经表示(DINR)网络进行4D时域重建的方法,该网络使用一种新颖的分布式随机训练算法进行训练。我们的DINR网络通过迭代优化其网络参数,使其输出能够通过优化与CT前向测量模型的输出相匹配,从而学习重建物体。我们使用一个连续时间和空间的前向测量模型,该模型与DINR输出在稀疏采样集上的连续值物体坐标相关。与现有的神经表示架构不同,它们通过密集体素网格前馈和反向传播来对整个时间空间进行前馈和反向传播,我们只在每次迭代中对DINR进行前馈和反向传播,从而降低内存和计算成本,实现模型的训练。DINR通过跨多个计算节点和GPU的分布式计算来产生高质量的4D时域重构,即使在非常大的CT数据量下也能实现。我们使用模拟并行光束和实验锥束X射线CT数据集来证明我们方法的优势。
https://arxiv.org/abs/2404.19075
Neural Radiance Fields (NeRF) show impressive performance in photo-realistic free-view rendering of scenes. Recent improvements on the NeRF such as TensoRF and ZipNeRF employ explicit models for faster optimization and rendering, as compared to the NeRF that employs an implicit representation. However, both implicit and explicit radiance fields require dense sampling of images in the given scene. Their performance degrades significantly when only a sparse set of views is available. Researchers find that supervising the depth estimated by a radiance field helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the main radiance field. Further, we aim to design a framework of regularizations that can work across different implicit and explicit radiance fields. We observe that certain features of these radiance field models overfit to the observed images in the sparse-input scenario. Our key finding is that reducing the capability of the radiance fields with respect to positional encoding, the number of decomposed tensor components or the size of the hash table, constrains the model to learn simpler solutions, which estimate better depth in certain regions. By designing augmented models based on such reduced capabilities, we obtain better depth supervision for the main radiance field. We achieve state-of-the-art view-synthesis performance with sparse input views on popular datasets containing forward-facing and 360$^\circ$ scenes by employing the above regularizations.
神经辐射场(NeRF)在真实感 free-view 渲染场景中表现出出色的性能。 TensoRF 和 ZipNeRF 这样的最新改进采用明确的模型来进行快速的优化和渲染,而与采用隐式表示的 NeRF 相比。然而,无论是隐式还是显式的辐射场都需要在给定的场景中进行图像的密集采样。当仅有一个稀疏的视图集合可用时,其性能会显著下降。研究人员发现,通过监督辐射场估计的深度来指导训练可以帮助它有效地利用更少的视图。深度监督可以通过经典方法或在大数据集上预训练的神经网络实现。然而,前者的深度监督可能只有稀疏的监督,而后者可能会面临泛化问题。与之前的方法相比,我们试图通过设计增强模型并将其与主辐射场一起训练来学习深度监督。此外,我们还旨在设计一个通用的正则框架,可以应用于不同的隐式和显式辐射场。我们观察到,这些辐射场模型的某些特征在稀疏输入场景中的观察图像上过拟合。我们关键的发现是,通过降低辐射场相对于位置编码的能力、分解张量的数量或哈希表的大小,约束模型学习更简单的解决方案,这些解决方案在某些区域上估计的深度更好。基于这种减少能力的增强模型,我们获得了更好的主辐射场深度监督。我们在包含前进面和 360$^\circ$ 场景的流行数据集上通过使用上述正则化方法实现了最先进的视图合成性能。
https://arxiv.org/abs/2404.19015