We develop the first (to the best of our knowledge) provably correct neural networks for a precise computational task, with the proof of correctness generated by an automated verification algorithm without any human input. Prior work on neural network verification has focused on partial specifications that, even when satisfied, are not sufficient to ensure that a neural network never makes errors. We focus on applying neural network verification to computational tasks with a precise notion of correctness, where a verifiably correct neural network provably solves the task at hand with no caveats. In particular, we develop an approach to train and verify the first provably correct neural networks for compressed sensing, i.e., recovering sparse vectors from a number of measurements smaller than the dimension of the vector. We show that for modest problem dimensions (up to 50), we can train neural networks that provably recover a sparse vector from linear and binarized linear measurements. Furthermore, we show that the complexity of the network (number of neurons/layers) can be adapted to the problem difficulty and solve problems where traditional compressed sensing methods are not known to provably work.
我们开发了第一个可证明正确的神经网络,针对一个精确的计算任务,该任务是通过自动验证算法生成正确性证明的,而没有任何人类输入。先前关于神经网络验证的工作主要集中在部分规格上,即使满足,也无法确保神经网络永远不会出错。我们专注于将神经网络验证应用于具有精确正确性的计算任务,其中可证明正确的神经网络确实可以不加任何限制地解决该任务。特别地,我们开发了一种方法来训练和验证压缩感知中第一个可证明正确的神经网络,即从比向量维度小的几个测量中恢复稀疏向量。我们证明了对于小问题维度(至多50个),我们可以训练神经网络从线性化和二进制化的线性测量中恢复稀疏向量。此外,我们还证明了网络复杂性(神经元/层数)可以适应问题难度,并且可以解决传统压缩感知方法无法保证正确性的问题。
https://arxiv.org/abs/2405.04260
Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
跨模态知识转移增强在激光雷达语义分割中改善点云表示学习。尽管具有潜在优势,但\textit{弱教师挑战}源于重复且缺乏多样性的车载相机图像以及稀疏且不准确的地面真实标签。为了应对这个问题,我们提出了高效的图像到激光雷达知识转移(ELiTe)范式。ELiTe引入了来自视觉基础模型(VFM)的补丁到点的多级知识蒸馏,在多样开放世界图像上进行了广泛训练。这使得能够在模态之间有效地进行知识传递。ELiTe采用参数高效的微调来加强VFM教师,并加速大规模模型训练,同时最小化成本。此外,我们还引入了基于伪标签生成的分割 anything模型,以增强低质量图像标签,促进稳健的语义表示。ELiTe通过显著更少的参数在SemanticKITTI基准上取得了最先进的性能,超过了实时推理模型。我们的方法通过显著更少的参数证实了其有效性和效率。
https://arxiv.org/abs/2405.04121
Object detection plays a critical role in autonomous driving, where accurately and efficiently detecting objects in fast-moving scenes is crucial. Traditional frame-based cameras face challenges in balancing latency and bandwidth, necessitating the need for innovative solutions. Event cameras have emerged as promising sensors for autonomous driving due to their low latency, high dynamic range, and low power consumption. However, effectively utilizing the asynchronous and sparse event data presents challenges, particularly in maintaining low latency and lightweight architectures for object detection. This paper provides an overview of object detection using event data in autonomous driving, showcasing the competitive benefits of event cameras.
物体检测在自动驾驶中扮演着关键角色,因为准确且高效地在快速移动的场景中检测物体至关重要。传统的基于帧的相机在平衡延迟和带宽方面面临挑战,需要采取创新解决方案。事件相机因低延迟、高动态范围和低功耗而成为自动驾驶的有前景的传感器。然而,有效地利用异步和稀疏事件数据存在挑战,特别是在维持低延迟和轻量架构进行物体检测方面。本文对使用事件数据进行物体检测在自动驾驶中的概述,展示了事件相机的竞争优势。
https://arxiv.org/abs/2405.03995
3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting ''triggers'' to poison the training dataset, backdoor attacks manipulate the detector's prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D triggers are injected to ensure their visibility within the sparse 3D space, rendering them easy to detect and impractical in real-world scenarios. In this paper, we delve into the robustness of 3D object detection, exploring a new backdoor attack surface through 2D cameras. Given the prevalent adoption of camera and LiDAR signal fusion for high-fidelity 3D perception, we investigate the latent potential of camera signals to disrupt the process. Although the dense nature of camera signals enables the use of nearly imperceptible small-sized triggers to mislead 2D object detection, realizing 2D-oriented backdoor attacks against 3D object detection is non-trivial. The primary challenge emerges from the fusion process that transforms camera signals into a 3D space, compromising the association with the 2D trigger to the target output. To tackle this issue, we propose an innovative 2D-oriented backdoor attack against LiDAR-camera fusion methods for 3D object detection, named BadFusion, for preserving trigger effectiveness throughout the entire fusion process. The evaluation demonstrates the effectiveness of BadFusion, achieving a significantly higher attack success rate compared to existing 2D-oriented attacks.
3D物体检测在自动驾驶中扮演着重要的角色;然而,它对后门攻击的易受性已经变得显而易见。通过向训练数据注入“触发器”,后门攻击会操纵检测器对包含这些触发器的输入的预测。针对3D物体检测的现有后门攻击主要污染3D激光雷达信号,大型的3D触发器被注入以确保它们在稀疏的3D空间中的可见性,从而使它们在现实场景中容易检测且不可行。在本文中,我们深入研究了3D物体检测的稳健性,通过2D摄像头探究新的后门攻击表面。 考虑到摄像头和激光雷达信号融合在高保真度3D感知中普遍采用,我们研究了摄像机信号的潜在破坏力,以打断该过程。尽管相机信号的密集性使得几乎无法察觉的小型触发器可以误导2D物体检测,但实现针对3D物体检测的2D方向后门攻击并非易事。主要挑战来自于将相机信号融合成一个3D空间的过程,这使得2D触发器与目标输出之间建立联系的过程受到威胁。为了解决这个问题,我们提出了名为BadFusion的 innovative 2D方向后门攻击,旨在在整个融合过程中保持触发器的有效性。评估结果表明,BadFusion的效果非常显著,其攻击成功率比现有2D方向攻击要高得多。
https://arxiv.org/abs/2405.03884
Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset.
从稀疏输入图像中进行 novel 视图合成是一个具有巨大实际意义的具有挑战性的问题,尤其是当相机姿态缺失或不准确时。直接优化相机姿态和利用估计深度在神经辐射场算法中通常不会产生好的结果,因为姿态和深度之间的耦合以及单目深度估计的不准确性。在本文中,我们利用最近的三维高斯展平方法来开发一种新的构建和优化方法,用于没有相机姿态的稀疏视图合成。具体来说,我们通过使用单目深度和将像素投影回三维世界来逐步构建解决方案。在构建过程中,我们通过检测训练视图和相应渲染图像之间的2D对应关系来优化解决方案。我们开发了一个统一的差分管道,用于相机姿态的注册和调整以及深度估计的逆投影。我们还引入了一个新颖的预期表面在 Gaussian 展平中的概念,这对于我们的优化非常重要。这些步骤使得收敛解决方案,然后可以通过标准优化方法进行低通滤波和精炼。我们在 Tanks 和 Temples 和 Static Hikes 数据集上进行了实验,使用最少的三种 widely-spaced 视角,证明了比竞争方法显著更好的质量。此外,我们的结果随着视角的增加而改善,甚至在使用一半数据时,其性能优于之前的 InstantNGP 和 Gaussian Splatting 算法。
https://arxiv.org/abs/2405.03659
Existing neural field-based SLAM methods typically employ a single monolithic field as their scene representation. This prevents efficient incorporation of loop closure constraints and limits scalability. To address these shortcomings, we propose a neural mapping framework which anchors lightweight neural fields to the pose graph of a sparse visual SLAM system. Our approach shows the ability to integrate large-scale loop closures, while limiting necessary reintegration. Furthermore, we verify the scalability of our approach by demonstrating successful building-scale mapping taking multiple loop closures into account during the optimization, and show that our method outperforms existing state-of-the-art approaches on large scenes in terms of quality and runtime. Our code is available at this https URL.
现有的基于神经场的三维SLAM方法通常使用单个单体化场作为其场景表示。这阻止了循环约束的 efficiently整合,并限制了可扩展性。为了克服这些缺陷,我们提出了一个神经映射框架,将轻量级神经场锚定到一个稀疏视觉SLAM系统的姿态图中。我们的方法展示了能够整合大型循环约束,同时限制必要的重新整合。此外,我们通过在优化过程中考虑多个循环约束来验证我们的方法的普适性,并在大场景上证明了其在质量和运行时间方面的优越性。我们的代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2405.03633
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
大规模语言模型(LLMs) revolutionized 自然语言处理(NLP),但它们的大小产生了计算瓶颈。我们提出了一种新方法来创建准确、稀疏的基本大语言模型,在稀疏度达到 70% 时实现对微调任务的完全准确性恢复。我们通过将 SparseGPT 一键修剪方法和 SlimPajama 数据集中的稀疏预训练方法相结合,在 LaMA-2 7B 模型上实现了这一目标。我们在 Cerebras CS-3 芯片上展示了由于稀疏度而产生的训练加速,这个加速与理论上的扩展速度非常接近。此外,我们还通过利用 Neural Magic 的 DeepSparse 引擎在 CPU 上实现 up to 3x 的推理加速,而在 GPU 上实现同样的加速需要 Neural Magic 的 nm-vllm 引擎,通过稀疏度实现上述增长。这些增长是通过稀疏度实现的,因此可以通过进一步的量化实现更多的增长。具体来说,我们在 CPU 上实现了稀疏量化 LaMA-2 模型总共的 8.6x 速度提升。我们在各种具有挑战性的任务中展示了这些结果,包括聊天、指令跟随、代码生成、算术推理和总结,以证明其普适性。这项工作为快速创建小而快速的 LLM 奠定了基础,同时不牺牲准确性。
https://arxiv.org/abs/2405.03594
Continuous Conditional Generative Modeling (CCGM) aims to estimate the distribution of high-dimensional data, typically images, conditioned on scalar continuous variables known as regression labels. While Continuous conditional Generative Adversarial Networks (CcGANs) were initially designed for this task, their adversarial training mechanism remains vulnerable to extremely sparse or imbalanced data, resulting in suboptimal outcomes. To enhance the quality of generated images, a promising alternative is to replace CcGANs with Conditional Diffusion Models (CDMs), renowned for their stable training process and ability to produce more realistic images. However, existing CDMs encounter challenges when applied to CCGM tasks due to several limitations such as inadequate U-Net architectures and deficient model fitting mechanisms for handling regression labels. In this paper, we introduce Continuous Conditional Diffusion Models (CCDMs), the first CDM designed specifically for the CCGM task. CCDMs address the limitations of existing CDMs by introducing specially designed conditional diffusion processes, a modified denoising U-Net with a custom-made conditioning mechanism, a novel hard vicinal loss for model fitting, and an efficient conditional sampling procedure. With comprehensive experiments on four datasets with varying resolutions ranging from 64x64 to 192x192, we demonstrate the superiority of the proposed CCDM over state-of-the-art CCGM models, establishing new benchmarks in CCGM. Extensive ablation studies validate the model design and implementation configuration of the proposed CCDM. Our code is publicly available at this https URL.
连续条件生成建模(CCGM)旨在根据已知的一维连续变量,通常为回归标签,估计高维数据的分布。虽然最初设计的CCGANs用于这项任务,但它们的对抗训练机制仍然容易受到极稀疏或不平衡数据的影响,导致性能较差。为了提高生成的图像的质量,一种有前途的替代方法是将CCGANs替换为条件扩散模型(CDMs),以其稳定的训练过程和生成更逼真的图像而闻名。然而,由于CCDMs在应用于CCGM任务时存在几个限制,如不足的U-Net架构和对于回归标签处理能力不足的模型拟合机制,因此现有的CDMs在应用于CCGM任务时遇到了挑战。在本文中,我们引入了专门为CCGM任务设计的连续条件扩散模型(CCDMs)。CCDMs通过引入特别设计的条件扩散过程、经过修改的带条件条件的U-Net、新的具有模型拟合特性的硬局部损失以及高效的条件抽样程序,解决了现有CDMs的局限性。在四个不同分辨率的公开数据集上进行全面的实验,从64x64到192x192,我们证明了所提出的CCDM相对于最先进的CCGM模型具有优越性,为CCGM设立了新的基准。广泛的消融分析验证了所提出的CCDM的模型设计和实现配置是正确的。我们的代码公开在https://这个URL上。
https://arxiv.org/abs/2405.03546
Accurate classification of medical images is essential for modern diagnostics. Deep learning advancements led clinicians to increasingly use sophisticated models to make faster and more accurate decisions, sometimes replacing human judgment. However, model development is costly and repetitive. Neural Architecture Search (NAS) provides solutions by automating the design of deep learning architectures. This paper presents ZO-DARTS+, a differentiable NAS algorithm that improves search efficiency through a novel method of generating sparse probabilities by bi-level optimization. Experiments on five public medical datasets show that ZO-DARTS+ matches the accuracy of state-of-the-art solutions while reducing search times by up to three times.
准确地分类医学图像对于现代诊断至关重要。深度学习的发展使得临床医生越来越依赖复杂的模型来做出更快、更准确的决策,有时甚至取代了人类的判断。然而,模型开发成本高且重复。神经架构搜索(NAS)通过自动设计深度学习架构提供了解决方案。本文介绍了ZO-DARTS+,一种可差分神经架构搜索算法,通过一种新的方法通过双层优化生成稀疏概率来提高搜索效率。在五个公开医疗数据集上的实验表明,ZO-DARTS+与最先进的解决方案在准确性上相匹敌,同时将搜索时间减少了30%以上。
https://arxiv.org/abs/2405.03462
Building accurate maps is a key building block to enable reliable localization, planning, and navigation of autonomous vehicles. We propose a novel approach for building accurate maps of dynamic environments utilizing a sequence of LiDAR scans. To this end, we propose encoding the 4D scene into a novel spatio-temporal implicit neural map representation by fitting a time-dependent truncated signed distance function to each point. Using our representation, we extract the static map by filtering the dynamic parts. Our neural representation is based on sparse feature grids, a globally shared decoder, and time-dependent basis functions, which we jointly optimize in an unsupervised fashion. To learn this representation from a sequence of LiDAR scans, we design a simple yet efficient loss function to supervise the map optimization in a piecewise way. We evaluate our approach on various scenes containing moving objects in terms of the reconstruction quality of static maps and the segmentation of dynamic point clouds. The experimental results demonstrate that our method is capable of removing the dynamic part of the input point clouds while reconstructing accurate and complete 3D maps, outperforming several state-of-the-art methods. Codes are available at: this https URL
建立准确的地图是实现自动驾驶车辆的可靠定位、规划和导航的关键模块。我们提出了一种利用连续激光雷达扫描序列建立动态环境中准确地图的新颖方法。为此,我们通过将4D场景编码为一个新颖的空间-时间隐式神经网络表示来达成目标。利用我们的表示,我们通过滤波动态部分来提取静态地图。我们的神经表示基于稀疏特征网格、全局共享的解码器以及时间相关的基函数,我们以自适应的方式共同优化。为了从一系列激光雷达扫描中学习这种表示,我们设计了一个简单而有效的损失函数来以片段方式监督地图优化。我们在各种包含运动物体的场景中评估我们的方法,根据静态地图的重建质量和动态点云的分割程度。实验结果表明,我们的方法能够删除输入点云中的动态部分,同时还原准确和完整的3D地图,优于其他最先进的方法。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2405.03388
Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
强化学习(RL)通过环境交互来学习策略是一个有前途的框架,但通常需要大量的交互数据来解决复杂任务。一方面包括通过离线数据增强 RL,以展示所需任务的强化学习,但先前的研究表明,通常需要大量高质量的演示数据,这很难获得,尤其是在机器人领域。我们的方法包括一个反向课程和一个正向课程。与先前的研究相比,我们方法的独特之处在于能够通过通过状态重置生成的每个演示的逆向课程来有效地利用多个演示。反向课程的结果是一个在窄的初始状态分布上表现良好的初始策略,并帮助克服困难的探索问题。然后使用正向课程来加速初始策略的训练,以在任务的全初始状态分布上表现良好,并提高演示和采样效率。我们证明了,在我们的方法 RFCL 中,结合反向课程和正向课程,能够显著提高演示和采样效率,与各种从演示中学习的基线相比,甚至解决了以前无法解决的任务,这些任务需要高精度和控制。
https://arxiv.org/abs/2405.03379
Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. The difficulty stems from two primary issues: (1) vision-processing mechanisms in the brain are highly intricate and not fully revealed, making it challenging to directly learn a mapping between fMRI and video; (2) the temporal resolution of fMRI is significantly lower than that of natural videos. To overcome these issues, this paper propose a two-stage model named Mind-Animator, which achieves state-of-the-art performance on three public datasets. Specifically, during the fMRI-to-feature stage, we decouple semantic, structural, and motion features from fMRI through fMRI-vision-language tri-modal contrastive learning and sparse causal attention. In the feature-to-video stage, these features are merged to videos by an inflated Stable Diffusion. We substantiate that the reconstructed video dynamics are indeed derived from fMRI, rather than hallucinations of the generative model, through permutation tests. Additionally, the visualization of voxel-wise and ROI-wise importance maps confirms the neurobiological interpretability of our model.
从脑活动重构人动态视觉是一个具有巨大科学意义的具有挑战性的任务。难度来自于两个主要问题:(1) 大脑中的视觉处理机制非常复杂,没有被完全揭示,因此直接从fMRI到视频的映射是困难的;(2) fMRI的时间分辨率远低于自然视频。为了克服这些困难,本文提出了一种名为Mind-Animator的两阶段模型,在三个公开数据集上实现了最先进的性能。具体来说,在fMRI到特征阶段,我们通过fMRI视觉语言三体对比学习将语义、结构和运动特征从fMRI中解耦,并通过稀疏相关注意来降低fMRI的分辨率。在特征到视频阶段,这些特征通过膨胀的稳定扩散合并成视频。我们通过置换测试证实,重构的视频动态确实来源于fMRI,而不是生成模型的虚像。此外,体积和区域重要性图的可视化证实了我们的模型的神经生物学解释性。
https://arxiv.org/abs/2405.03280
In the course of the energy transition, the expansion of generation and consumption will change, and many of these technologies, such as PV systems, electric cars and heat pumps, will influence the power flow, especially in the distribution grids. Scalable methods that can make decisions for each grid connection are needed to enable congestion-free grid operation in the distribution grids. This paper presents a novel end-to-end approach to resolving congestion in distribution grids with deep reinforcement learning. Our architecture learns to curtail power and set appropriate reactive power to determine a non-congested and, thus, feasible grid state. State-of-the-art methods such as the optimal power flow (OPF) demand high computational costs and detailed measurements of every bus in a grid. In contrast, the presented method enables decisions under sparse information with just some buses observable in the grid. Distribution grids are generally not yet fully digitized and observable, so this method can be used for decision-making on the majority of low-voltage grids. On a real low-voltage grid the approach resolves 100\% of violations in the voltage band and 98.8\% of asset overloads. The results show that decisions can also be made on real grids that guarantee sufficient quality for congestion-free grid operation.
在能源转型的过程中,发电和消费的扩张将发生变化,许多技术,如光伏系统、电动汽车和热泵,将影响电力流动,特别是在配电电网。可扩展的方法需要为每个电网连接做出决策,以便在配电电网中实现无拥塞的电网运行。本文介绍了一种用深度强化学习解决配电电网拥塞的新型端到端方法。我们的架构学会了抑制电力和设置适当的反应电力来确定一个非拥塞的可行电网状态。与最优功率流(OPF)等先进方法相比,所提出的 method 可以在稀疏信息下做出决策,只需观察网格中的少数几条公交车。配电电网通常尚未完全实现数字化和观测,因此这种方法可以用于大多数低电压电网的决策。在实际低电压电网上,该方法可以完全解决电压带内的100%违规行为和98.8%的资产过载。结果表明,在保证足够质量的电网运行的情况下,也可以在实际电网上做出决策。
https://arxiv.org/abs/2405.03262
Deep learning has emerged as a promising approach for learning the nonlinear mapping between diffusion-weighted MR images and tissue parameters, which enables automatic and deep understanding of the brain microstructures. However, the efficiency and accuracy in the multi-parametric estimations are still limited since previous studies tend to estimate multi-parametric maps with dense sampling and isolated signal modeling. This paper proposes DeepMpMRI, a unified framework for fast and high-fidelity multi-parametric estimation from various diffusion models using sparsely sampled q-space data. DeepMpMRI is equipped with a newly designed tensor-decomposition-based regularizer to effectively capture fine details by exploiting the correlation across parameters. In addition, we introduce a Nesterov-based adaptive learning algorithm that optimizes the regularization parameter dynamically to enhance the performance. DeepMpMRI is an extendable framework capable of incorporating flexible network architecture. Experimental results demonstrate the superiority of our approach over 5 state-of-the-art methods in simultaneously estimating multi-parametric maps for various diffusion models with fine-grained details both quantitatively and qualitatively, achieving 4.5 - 22.5$\times$ acceleration compared to the dense sampling of a total of 270 diffusion gradients.
深度学习已经成为了学习扩散加权磁共振图像(DWI)和组织参数之间非线性映射的有前途的方法,这使得我们能够自动和深入理解大脑微观结构。然而,多参数估计的效率和准确性仍然有限,因为以前的研究倾向于使用稀疏采样和离散信号建模来估计多参数映射。本文提出DeepMpMRI,一种基于稀疏采样q空间数据的统一框架,用于从各种扩散模型进行高速和高保真的多参数估计。DeepMpMRI配备了一个新设计的张量分解基于正则化的特征,通过利用参数之间的相关性有效地捕捉细节。此外,我们引入了一种Nesterov基于自适应学习算法,动态优化正则化参数以提高性能。DeepMpMRI是一个可扩展的框架,能够容纳灵活的网络架构。实验结果表明,我们的方法在同时估计多种扩散模型的细粒度多参数映射方面具有优越性,超过5种最先进的无监督学习方法,实现了4.5 - 22.5×的加速,相比总共270个扩散梯度的密集采样。
https://arxiv.org/abs/2405.03159
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
深度强化学习(DRL)在现实应用中扮演着越来越重要的角色。然而,为了在复杂任务中获得最优的DRL代理器,特别是稀疏奖励,仍然是一个具有挑战性的问题。DRL代理商的训练常常陷入瓶颈,没有进一步的进步。在本文中,我们提出了RICE,一种创新的强化学习优化方案,它引入了解释方法来突破训练瓶颈。RICE的高层次思路是在组合默认初始状态和通过解释方法确定的临界状态的基础上构建一个新的初始状态分布,从而鼓励代理商从混合初始状态中进行探索。通过仔细的设计,我们可以理论上将我们的优化方案的子最优解界变得更小。我们在各种流行RL环境和现实应用中评估了RICE。结果表明,RICE在提高代理商性能方面显著优于现有的优化方案。
https://arxiv.org/abs/2405.03064
3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.
3D物体检测器通常依赖于基于池化的点网络将稀疏点编码成类似于网格状的体素或柱状体。在本文中,我们发现常见的点网络设计引入了一个信息瓶颈,这限制了3D物体检测的准确性和可扩展性。为了应对这个限制,我们提出了PVTransformer:一个基于Transformer的点对体素架构。我们的关键想法是用一个关注模块取代PointNet的池化操作,导致更好的点对体素聚合函数。我们的设计尊重稀疏3D点的排列不变性,同时比基于池化的点网络更具表现力。实验结果表明,我们的PVTransformer在最新的3D物体检测器上的性能远优于现有的检测器。在广泛使用的Waymo Open Dataset上,我们的PVTransformer实现了 state-of-the-art 76.5 mAPH L2,比先前的技术水平+1.7 mAPH L2 更好。
https://arxiv.org/abs/2405.02811
The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.
周围交通参与者的3D运动感知对驾驶安全至关重要。虽然现有的工作主要关注大的运动,但我们认为微妙的运动的即时检测和量化同样重要。它们表明了驾驶行为中可能具有关键性的细微差别,比如靠近停车标志的行为。我们深入研究这个尚未被充分探索的任务,检查其独特挑战,并开发我们的解决方案,同时附带一个精心设计的基准。 具体来说,由于连续帧之间稀疏的Lidar点云之间没有对应关系,静止物体可能看起来在运动 - 所谓的游泳效应。这种交织与真实物体运动相互作用,从而导致对准确估计的模糊不确定性,特别是在微妙运动上。为了应对这个问题,我们提出了一种利用局部占有率完成物体点云的方法来填充形状线索,并减轻游泳伪影的影响。占有率完成是在物体开始运动时同时检测和估计其运动的过程中学习的。 大量的实验证明,与标准3D运动估计方法相比,我们的方法具有卓越的性能,特别是突出了我们方法对微妙运动的专门处理。
https://arxiv.org/abs/2405.02781
Active learning in 3D scene reconstruction has been widely studied, as selecting informative training views is critical for the reconstruction. Recently, Neural Radiance Fields (NeRF) variants have shown performance increases in active 3D reconstruction using image rendering or geometric uncertainty. However, the simultaneous consideration of both uncertainties in selecting informative views remains unexplored, while utilizing different types of uncertainty can reduce the bias that arises in the early training stage with sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate views considering both uncertainties. ActiveNeuS provides a way to accumulate image rendering uncertainty while avoiding the bias that the estimated densities can introduce. ActiveNeuS computes the neural implicit surface uncertainty, providing the color uncertainty along with the surface information. It efficiently handles the bias by using the surface information and a grid, enabling the fast selection of diverse viewpoints. Our method outperforms previous works on popular datasets, Blender and DTU, showing that the views selected by ActiveNeuS significantly improve performance.
已经在三维场景重建中广泛研究了积极学习,因为选择有信息的训练视角对于重建至关重要。最近,神经辐射场(NeRF)变体通过图像渲染或几何不确定性在积极三维重建中显示出性能提高。然而,同时考虑选择信息丰富的视角仍然是一个未探索的问题,而利用不同类型的不确定性可以减少在训练早期阶段出现稀疏输入导致的偏差。在本文中,我们提出了ActiveNeuS,它考虑了 both uncertainties(不确定性)。ActiveNeuS通过累积图像渲染不确定性,同时避免估计密度可能引入的偏差。ActiveNeuS计算神经隐性表面不确定性,提供表面信息以及颜色不确定性。它有效地处理偏差,通过表面信息和网格实现观点的快速选择。我们的方法在流行的数据集Blender和DTU上优于以前的工作,表明ActiveNeuS选择的观点显著提高了性能。
https://arxiv.org/abs/2405.02568
Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves scanning similar subjects, we propose a novel approach to improve reconstruction quality through joint reconstruction of multiple objects using INRs. This approach can potentially leverage both the strengths of INRs and the statistical regularities across multiple objects. While current INR joint reconstruction techniques primarily focus on accelerating convergence via meta-initialization, they are not specifically tailored to enhance reconstruction quality. To address this gap, we introduce a novel INR-based Bayesian framework integrating latent variables to capture the inter-object relationships. These variables serve as a dynamic reference throughout the optimization, thereby enhancing individual reconstruction fidelity. Our extensive experiments, which assess various key factors such as reconstruction quality, resistance to overfitting, and generalizability, demonstrate significant improvements over baselines in common numerical metrics. This underscores a notable advancement in CT reconstruction methods.
计算断层成像(CT)在工业品质控制和医学诊断中具有关键作用。稀疏视野CT由于其欠采样特性,面临挑战,导致欠拟合重建问题。随着隐式神经表示(INRs)的最近进展,显示了在解决稀疏视野CT重建方面取得进展的前景。认识到CT通常涉及对类似被试的扫描,我们提出了一种通过使用INRs共同重构多个对象来提高重建质量的新方法。这种方法可以利用INRs的优点和多个对象之间的统计 regularities。尽管当前的INR联合重建技术主要通过元初始化加速收敛,但它们并未专门针对提高重建质量进行优化。为了填补这一空白,我们引入了一个基于INRs的新颖贝叶斯框架,将潜在变量集成在一起,以捕捉对象之间的交互关系。这些变量在优化过程中充当动态参考,从而提高每个重建对象的准确性。我们对各种关键因素(如重建质量、过拟合抵抗性和泛化能力)的广泛实验证明,在常见数值指标上,基准线以上显著改善。这表明在CT重建方法上取得了显著的进展。
https://arxiv.org/abs/2405.02509
We present a novel agent-based approach to simulating an over-the-counter (OTC) financial market in which trades are intermediated solely by market makers and agent visibility is constrained to a network topology. Dynamics, such as changes in price, result from agent-level interactions that ubiquitously occur via market maker agents acting as liquidity providers. Two additional agents are considered: trend investors use a deep convolutional neural network paired with a deep Q-learning framework to inform trading decisions by analysing price history; and value investors use a static price-target to determine their trade directions and sizes. We demonstrate that our novel inclusion of a network topology with market makers facilitates explorations into various market structures. First, we present the model and an overview of its mechanics. Second, we validate our findings via comparison to the real-world: we demonstrate a fat-tailed distribution of price changes, auto-correlated volatility, a skew negatively correlated to market maker positioning, predictable price-history patterns and more. Finally, we demonstrate that our network-based model can lend insights into the effect of market-structure on price-action. For example, we show that markets with sparsely connected intermediaries can have a critical point of fragmentation, beyond which the market forms distinct clusters and arbitrage becomes rapidly possible between the prices of different market makers. A discussion is provided on future work that would be beneficial.
我们提出了一个新颖的基于代理的模拟超额交易金融市场的算法,其中交易仅由市场制造商代理进行中介,代理的可见性受到网络拓扑结构的限制。动态,如价格变化,源于市场制造商代理作为流动性提供者普遍发生的代理水平相互作用。我们还考虑了两个额外的代理:趋势投资者使用深度卷积神经网络与深度 Q-学习框架分析价格历史来告知交易决策;价值投资者使用静态价格目标来确定他们的交易方向和规模。我们证明了在我们的新颖加入市场拓扑结构与市场制造商的情况下,可以探索各种市场结构。首先,我们介绍了模型及其工作原理。其次,我们通过与现实世界的比较验证了我们的研究结果:我们证明了价格变化具有脂肪尾分布,自相关波动,市场制造商位置的 skew 负相关,可预测的价格历史模式以及更多。最后,我们证明了基于网络的模型可以揭示市场结构对价格行动的影响。例如,我们证明了稀疏连接的中介市场中,市场可能会出现临界点,超过这个临界点,市场将形成明显的簇,套利在不同的市场制造商价格之间变得迅速可能。我们还提供了未来工作的讨论。
https://arxiv.org/abs/2405.02480