Deep learning has emerged as a promising approach for learning the nonlinear mapping between diffusion-weighted MR images and tissue parameters, which enables automatic and deep understanding of the brain microstructures. However, the efficiency and accuracy in the multi-parametric estimations are still limited since previous studies tend to estimate multi-parametric maps with dense sampling and isolated signal modeling. This paper proposes DeepMpMRI, a unified framework for fast and high-fidelity multi-parametric estimation from various diffusion models using sparsely sampled q-space data. DeepMpMRI is equipped with a newly designed tensor-decomposition-based regularizer to effectively capture fine details by exploiting the correlation across parameters. In addition, we introduce a Nesterov-based adaptive learning algorithm that optimizes the regularization parameter dynamically to enhance the performance. DeepMpMRI is an extendable framework capable of incorporating flexible network architecture. Experimental results demonstrate the superiority of our approach over 5 state-of-the-art methods in simultaneously estimating multi-parametric maps for various diffusion models with fine-grained details both quantitatively and qualitatively, achieving 4.5 - 22.5$\times$ acceleration compared to the dense sampling of a total of 270 diffusion gradients.
深度学习已经成为了学习扩散加权磁共振图像(DWI)和组织参数之间非线性映射的有前途的方法,这使得我们能够自动和深入理解大脑微观结构。然而,多参数估计的效率和准确性仍然有限,因为以前的研究倾向于使用稀疏采样和离散信号建模来估计多参数映射。本文提出DeepMpMRI,一种基于稀疏采样q空间数据的统一框架,用于从各种扩散模型进行高速和高保真的多参数估计。DeepMpMRI配备了一个新设计的张量分解基于正则化的特征,通过利用参数之间的相关性有效地捕捉细节。此外,我们引入了一种Nesterov基于自适应学习算法,动态优化正则化参数以提高性能。DeepMpMRI是一个可扩展的框架,能够容纳灵活的网络架构。实验结果表明,我们的方法在同时估计多种扩散模型的细粒度多参数映射方面具有优越性,超过5种最先进的无监督学习方法,实现了4.5 - 22.5×的加速,相比总共270个扩散梯度的密集采样。
https://arxiv.org/abs/2405.03159
Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
深度强化学习(DRL)在现实应用中扮演着越来越重要的角色。然而,为了在复杂任务中获得最优的DRL代理器,特别是稀疏奖励,仍然是一个具有挑战性的问题。DRL代理商的训练常常陷入瓶颈,没有进一步的进步。在本文中,我们提出了RICE,一种创新的强化学习优化方案,它引入了解释方法来突破训练瓶颈。RICE的高层次思路是在组合默认初始状态和通过解释方法确定的临界状态的基础上构建一个新的初始状态分布,从而鼓励代理商从混合初始状态中进行探索。通过仔细的设计,我们可以理论上将我们的优化方案的子最优解界变得更小。我们在各种流行RL环境和现实应用中评估了RICE。结果表明,RICE在提高代理商性能方面显著优于现有的优化方案。
https://arxiv.org/abs/2405.03064
3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.
3D物体检测器通常依赖于基于池化的点网络将稀疏点编码成类似于网格状的体素或柱状体。在本文中,我们发现常见的点网络设计引入了一个信息瓶颈,这限制了3D物体检测的准确性和可扩展性。为了应对这个限制,我们提出了PVTransformer:一个基于Transformer的点对体素架构。我们的关键想法是用一个关注模块取代PointNet的池化操作,导致更好的点对体素聚合函数。我们的设计尊重稀疏3D点的排列不变性,同时比基于池化的点网络更具表现力。实验结果表明,我们的PVTransformer在最新的3D物体检测器上的性能远优于现有的检测器。在广泛使用的Waymo Open Dataset上,我们的PVTransformer实现了 state-of-the-art 76.5 mAPH L2,比先前的技术水平+1.7 mAPH L2 更好。
https://arxiv.org/abs/2405.02811
The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.
周围交通参与者的3D运动感知对驾驶安全至关重要。虽然现有的工作主要关注大的运动,但我们认为微妙的运动的即时检测和量化同样重要。它们表明了驾驶行为中可能具有关键性的细微差别,比如靠近停车标志的行为。我们深入研究这个尚未被充分探索的任务,检查其独特挑战,并开发我们的解决方案,同时附带一个精心设计的基准。 具体来说,由于连续帧之间稀疏的Lidar点云之间没有对应关系,静止物体可能看起来在运动 - 所谓的游泳效应。这种交织与真实物体运动相互作用,从而导致对准确估计的模糊不确定性,特别是在微妙运动上。为了应对这个问题,我们提出了一种利用局部占有率完成物体点云的方法来填充形状线索,并减轻游泳伪影的影响。占有率完成是在物体开始运动时同时检测和估计其运动的过程中学习的。 大量的实验证明,与标准3D运动估计方法相比,我们的方法具有卓越的性能,特别是突出了我们方法对微妙运动的专门处理。
https://arxiv.org/abs/2405.02781
Active learning in 3D scene reconstruction has been widely studied, as selecting informative training views is critical for the reconstruction. Recently, Neural Radiance Fields (NeRF) variants have shown performance increases in active 3D reconstruction using image rendering or geometric uncertainty. However, the simultaneous consideration of both uncertainties in selecting informative views remains unexplored, while utilizing different types of uncertainty can reduce the bias that arises in the early training stage with sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate views considering both uncertainties. ActiveNeuS provides a way to accumulate image rendering uncertainty while avoiding the bias that the estimated densities can introduce. ActiveNeuS computes the neural implicit surface uncertainty, providing the color uncertainty along with the surface information. It efficiently handles the bias by using the surface information and a grid, enabling the fast selection of diverse viewpoints. Our method outperforms previous works on popular datasets, Blender and DTU, showing that the views selected by ActiveNeuS significantly improve performance.
已经在三维场景重建中广泛研究了积极学习,因为选择有信息的训练视角对于重建至关重要。最近,神经辐射场(NeRF)变体通过图像渲染或几何不确定性在积极三维重建中显示出性能提高。然而,同时考虑选择信息丰富的视角仍然是一个未探索的问题,而利用不同类型的不确定性可以减少在训练早期阶段出现稀疏输入导致的偏差。在本文中,我们提出了ActiveNeuS,它考虑了 both uncertainties(不确定性)。ActiveNeuS通过累积图像渲染不确定性,同时避免估计密度可能引入的偏差。ActiveNeuS计算神经隐性表面不确定性,提供表面信息以及颜色不确定性。它有效地处理偏差,通过表面信息和网格实现观点的快速选择。我们的方法在流行的数据集Blender和DTU上优于以前的工作,表明ActiveNeuS选择的观点显著提高了性能。
https://arxiv.org/abs/2405.02568
Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves scanning similar subjects, we propose a novel approach to improve reconstruction quality through joint reconstruction of multiple objects using INRs. This approach can potentially leverage both the strengths of INRs and the statistical regularities across multiple objects. While current INR joint reconstruction techniques primarily focus on accelerating convergence via meta-initialization, they are not specifically tailored to enhance reconstruction quality. To address this gap, we introduce a novel INR-based Bayesian framework integrating latent variables to capture the inter-object relationships. These variables serve as a dynamic reference throughout the optimization, thereby enhancing individual reconstruction fidelity. Our extensive experiments, which assess various key factors such as reconstruction quality, resistance to overfitting, and generalizability, demonstrate significant improvements over baselines in common numerical metrics. This underscores a notable advancement in CT reconstruction methods.
计算断层成像(CT)在工业品质控制和医学诊断中具有关键作用。稀疏视野CT由于其欠采样特性,面临挑战,导致欠拟合重建问题。随着隐式神经表示(INRs)的最近进展,显示了在解决稀疏视野CT重建方面取得进展的前景。认识到CT通常涉及对类似被试的扫描,我们提出了一种通过使用INRs共同重构多个对象来提高重建质量的新方法。这种方法可以利用INRs的优点和多个对象之间的统计 regularities。尽管当前的INR联合重建技术主要通过元初始化加速收敛,但它们并未专门针对提高重建质量进行优化。为了填补这一空白,我们引入了一个基于INRs的新颖贝叶斯框架,将潜在变量集成在一起,以捕捉对象之间的交互关系。这些变量在优化过程中充当动态参考,从而提高每个重建对象的准确性。我们对各种关键因素(如重建质量、过拟合抵抗性和泛化能力)的广泛实验证明,在常见数值指标上,基准线以上显著改善。这表明在CT重建方法上取得了显著的进展。
https://arxiv.org/abs/2405.02509
We present a novel agent-based approach to simulating an over-the-counter (OTC) financial market in which trades are intermediated solely by market makers and agent visibility is constrained to a network topology. Dynamics, such as changes in price, result from agent-level interactions that ubiquitously occur via market maker agents acting as liquidity providers. Two additional agents are considered: trend investors use a deep convolutional neural network paired with a deep Q-learning framework to inform trading decisions by analysing price history; and value investors use a static price-target to determine their trade directions and sizes. We demonstrate that our novel inclusion of a network topology with market makers facilitates explorations into various market structures. First, we present the model and an overview of its mechanics. Second, we validate our findings via comparison to the real-world: we demonstrate a fat-tailed distribution of price changes, auto-correlated volatility, a skew negatively correlated to market maker positioning, predictable price-history patterns and more. Finally, we demonstrate that our network-based model can lend insights into the effect of market-structure on price-action. For example, we show that markets with sparsely connected intermediaries can have a critical point of fragmentation, beyond which the market forms distinct clusters and arbitrage becomes rapidly possible between the prices of different market makers. A discussion is provided on future work that would be beneficial.
我们提出了一个新颖的基于代理的模拟超额交易金融市场的算法,其中交易仅由市场制造商代理进行中介,代理的可见性受到网络拓扑结构的限制。动态,如价格变化,源于市场制造商代理作为流动性提供者普遍发生的代理水平相互作用。我们还考虑了两个额外的代理:趋势投资者使用深度卷积神经网络与深度 Q-学习框架分析价格历史来告知交易决策;价值投资者使用静态价格目标来确定他们的交易方向和规模。我们证明了在我们的新颖加入市场拓扑结构与市场制造商的情况下,可以探索各种市场结构。首先,我们介绍了模型及其工作原理。其次,我们通过与现实世界的比较验证了我们的研究结果:我们证明了价格变化具有脂肪尾分布,自相关波动,市场制造商位置的 skew 负相关,可预测的价格历史模式以及更多。最后,我们证明了基于网络的模型可以揭示市场结构对价格行动的影响。例如,我们证明了稀疏连接的中介市场中,市场可能会出现临界点,超过这个临界点,市场将形成明显的簇,套利在不同的市场制造商价格之间变得迅速可能。我们还提供了未来工作的讨论。
https://arxiv.org/abs/2405.02480
The segmentation of individual trees from forest point clouds is a crucial task for downstream analyses such as carbon sequestration estimation. Recently, deep-learning-based methods have been proposed which show the potential of learning to segment trees. Since these methods are trained in a supervised way, the question arises how general models can be obtained that are applicable across a wide range of settings. So far, training has been mainly conducted with data from one specific laser scanning type and for specific types of forests. In this work, we train one segmentation model under various conditions, using seven diverse datasets found in literature, to gain insights into the generalization capabilities under domain-shift. Our results suggest that a generalization from coniferous dominated sparse point clouds to deciduous dominated high-resolution point clouds is possible. Conversely, qualitative evidence suggests that generalization from high-resolution to low-resolution point clouds is challenging. This emphasizes the need for forest point clouds with diverse data characteristics for model development. To enrich the available data basis, labeled trees from two previous works were propagated to the complete forest point cloud and are made publicly available at this https URL.
从森林点云中提取单个树木是一个关键的任务,对于诸如碳储存估计等下游分析具有至关重要的意义。最近,基于深度学习的方法已经被提出,展示了从学习分割树木的潜力。由于这些方法以有监督的方式进行训练,因此问题是如何获得适用于各种设置的通用的模型。到目前为止,主要使用来自特定激光扫描类型的数据和特定类型的森林进行训练。在本文中,我们使用来自文献中七种不同数据集的一个分割模型,在各种条件下进行训练,以探究领域漂移下的泛化能力。我们的结果表明,从针叶林点云到落叶林点云的泛化是可能的。相反,定性证据表明,从高分辨率到低分辨率点云的泛化具有挑战性。这强调了需要具有多样数据特征的森林点云来支持模型开发。为了丰富现有的数据基础,来自两个以前工作的带有标签的树木被传播到完整的森林点云,并在此处公开发布,链接在此:https://www.xxx。
https://arxiv.org/abs/2405.02061
This paper introduces the Sparse Tsetlin Machine (STM), a novel Tsetlin Machine (TM) that processes sparse data efficiently. Traditionally, the TM does not consider data characteristics such as sparsity, commonly seen in NLP applications and other bag-of-word-based representations. Consequently, a TM must initialize, store, and process a significant number of zero values, resulting in excessive memory usage and computational time. Previous attempts at creating a sparse TM have predominantly been unsuccessful, primarily due to their inability to identify which literals are sufficient for TM training. By introducing Active Literals (AL), the STM can focus exclusively on literals that actively contribute to the current data representation, significantly decreasing memory footprint and computational time while demonstrating competitive classification performance.
本文介绍了稀疏Tsetlin机器(STM),一种高效的处理稀疏数据的Tsetlin机器(TM)。传统TM不考虑诸如NLP应用和其它基于单词的表示中的稀疏数据特点。因此,TM必须初始化、存储并处理大量零值,导致过度的内存使用和计算时间。之前尝试创建稀疏TM的努力普遍都是失败的,主要原因是它们无法确定哪些符号对于TM训练是足够的。通过引入活动符号(AL),STM可以专注于对当前数据表示积极贡献的符号,从而显著减小内存足迹和计算时间,同时表现出竞争力的分类性能。
https://arxiv.org/abs/2405.02375
The increasing demand for underwater vehicles highlights the necessity for robust localization solutions in inspection missions. In this work, we present a novel real-time sonar-based underwater global positioning algorithm for AUVs (Autonomous Underwater Vehicles) designed for environments with a sparse distribution of human-made assets. Our approach exploits two synergistic data interpretation frontends applied to the same stream of sonar data acquired by a multibeam Forward-Looking Sonar (FSD). These observations are fused within a Particle Filter (PF) either to weigh more particles that belong to high-likelihood regions or to solve symmetric ambiguities. Preliminary experiments carried out on a simulated environment resembling a real underwater plant provided promising results. This work represents a starting point towards future developments of the method and consequent exhaustive evaluations also in real-world scenarios.
增加的水下车辆的需求突出了在检查任务中实现稳健本地化解决方案的必要性。在这项工作中,我们提出了一个适用于环境中有稀疏分布的人类资产的水下全局定位算法,用于自主水下车辆(AUVs)。我们的方法利用了在同一多束前进式声呐(FSD)获得的声呐数据流中应用的两个协同数据解释前景。这些观察结果可以融合到一个粒子滤波器(PF)中,以便更重地考虑属于高可能性区域的分子的权重,或者解决对称模糊性。在模拟水下环境中进行初步实验,类似于真实水下植物,产生了积极的结果。这项工作代表了该方法未来发展和真实世界场景中进行详细评估的开端。
https://arxiv.org/abs/2405.01971
The rapid advancement in Large Language Models (LLMs) has markedly enhanced the capabilities of language understanding and generation. However, the substantial model size poses hardware challenges, affecting both memory size for serving and inference latency for token generation. To address those challenges, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel method for the recent prevalent SwiGLU-based LLMs pruning. Our approach incorporates structural dependency into the weight magnitude-based unstructured pruning. We introduce an MLP-specific pruning metric that evaluates the importance of each weight by jointly considering its magnitude and its corresponding MLP intermediate activation norms. DaSS facilitates a balance between the adaptability offered by unstructured pruning and the structural consistency inherent in dependency-based structured pruning. Empirical evaluations on Mistral and LLaMA2 model families demonstrate that DaSS not only outperforms both SparseGPT and Wanda in achieving hardware-friendly N:M sparsity patterns but also maintains the computational efficiency of Wanda.
大规模语言模型(LLMs)的快速发展大大提高了自然语言理解和生成的能力。然而,大型模型的庞大尺寸带来了硬件挑战,影响了用于生成和推理延迟的内存大小。为了应对这些挑战,我们提出了Dependency-aware Semi-structured Sparsity(DaSS),一种基于SwiGLU的LLM修剪的新方法。我们的方法将结构依赖融入了基于权重大小的不结构化修剪。我们引入了一个针对MLP的修剪指标,通过同时考虑权重的规模和相应MLP中间激活规范来评估每个权重的重要性。DaSS在提供无结构修剪的适应性同时,保留了基于依赖关系的结构化修剪的计算效率。在Mistral和LLA2模型家族的实证评估中,DaSS不仅超越了SparseGPT和Wanda在实现硬件友好的N:M稀疏模式方面的表现,而且保持了Wanda的计算效率。
https://arxiv.org/abs/2405.01943
Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter challenges in handling sparse point clouds, achieving real-time continuous classification, and coping with limited monitoring ranges when statically mounted. To overcome these limitations, we propose RobHAR, a movable robot-mounted mmWave radar system with lightweight deep neural networks for real-time monitoring of human activities. Specifically, we first propose a sparse point cloud-based global embedding to learn the features of point clouds using the light-PointNet (LPN) backbone. Then, we learn the temporal pattern with a bidirectional lightweight LSTM model (BiLiLSTM). In addition, we implement a transition optimization strategy, integrating the Hidden Markov Model (HMM) with Connectionist Temporal Classification (CTC) to improve the accuracy and robustness of the continuous HAR. Our experiments on three datasets indicate that our method significantly outperforms the previous studies in both discrete and continuous HAR tasks. Finally, we deploy our system on a movable robot-mounted edge computing platform, achieving flexible healthcare monitoring in real-world scenarios.
医疗监测对于单独生活的老年人来说至关重要。它可以检测到像跌倒这样的危险情况,并为拯救生命提供及时的警报。基于先进的人活动识别(HAR)模型的非侵入性毫米波(mmWave)医疗监测系统近年来引起了广泛关注。然而,它们在处理稀疏点云、实现实时连续分类和应对有限监测范围时遇到了挑战。为了克服这些限制,我们提出了RobHAR,一种可移动的机器人搭载的mmWave雷达系统,用于实时监测人类活动。具体来说,我们首先提出了基于稀疏点云的全局嵌入来学习点云的特征,使用光点网络(LPN)骨干网络。然后,我们使用双向轻量级LSTM模型学习时间模式。此外,我们还实现了一个转换优化策略,将隐马尔可夫模型(HMM)与连接式 Temporal Classification(CTC)结合,以提高连续 HAR的准确性和鲁棒性。我们对三个数据集的实验结果表明,我们的方法在离散和连续 HAR任务中显著超过了之前的研究。最后,我们将系统部署在可移动机器人搭载的边缘计算平台上,实现了在现实场景中灵活的医疗监测。
https://arxiv.org/abs/2405.01882
The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.
计算机辅助分割方法可以帮助医疗人员提高诊断结果。虽然像UNet及其变体这样的最近进展显示出前景,但它们面临着一个关键挑战:平衡准确性和计算效率。UNet中的浅层编码器架构通常很难捕捉关键的空间特征,导致不准确和稀疏分割。为了应对这个局限,我们提出了一个新颖的移动UNet(PAM-UNet)架构。PAM-UNet中的倒置残差(IR)块有助于保持轻量级框架,而逐层的PLA(渐进式洪注意力)通过将注意力指向感兴趣区域在合成过程中进行定向,促进了精确分割。我们的方法将准确性和速度优先考虑,实现了74.65的均IoU和82.87的 dice分数,同时仅在LiTS 2017数据集上需要每秒1.32个浮点运算(FLOPs)。这些结果强调了开发高效的分割模型以加速人工智能在临床实践中的采用的重要性。
https://arxiv.org/abs/2405.01503
Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.
近年来,手物体重建主要集中在单视图和密集多视图设置。一方面,单视图方法可以利用学习到的形状先验来推广到未见到的物体,但由于遮挡而存在误差。另一方面,密集多视图方法非常准确,但需要进一步的数据收集才能适应未见到的物体。相比之下,稀疏多视图方法可以利用附加的视图来解决遮挡问题,而 computational cost 较低,相对于密集多视图方法。在本文中,我们考虑在稀疏多视图设置下处理未见物体的问题。 给定同时捕捉到同一手和物体的多个 RGB 图像,我们的模型 SVHO 将每个视图的预测合并为统一的重建,无需在视图中进行优化。我们在合成手-物体数据集上训练模型,并直接在真实世界记录的手-物体数据集上进行评估。我们证明了,虽然从 RGB 重建未见的手和物体具有挑战性,但附加视图可以提高重建质量。
https://arxiv.org/abs/2405.01353
Edge computing allows artificial intelligence and machine learning models to be deployed on edge devices, where they can learn from local data and collaborate to form a global model. Federated learning (FL) is a distributed machine learning technique that facilitates this process while preserving data privacy. However, FL also faces challenges such as high computational and communication costs regarding resource-constrained devices, and poor generalization performance due to the heterogeneity of data across edge clients and the presence of out-of-distribution data. In this paper, we propose the Gradient-Congruity Guided Federated Sparse Training (FedSGC), a novel method that integrates dynamic sparse training and gradient congruity inspection into federated learning framework to address these issues. Our method leverages the idea that the neurons, in which the associated gradients with conflicting directions with respect to the global model contain irrelevant or less generalized information for other clients, and could be pruned during the sparse training process. Conversely, the neurons where the associated gradients with consistent directions could be grown in a higher priority. In this way, FedSGC can greatly reduce the local computation and communication overheads while, at the same time, enhancing the generalization abilities of FL. We evaluate our method on challenging non-i.i.d settings and show that it achieves competitive accuracy with state-of-the-art FL methods across various scenarios while minimizing computation and communication costs.
边缘计算允许人工智能和机器学习模型在边缘设备上部署,从本地数据中学习并协同形成全局模型。联邦学习(FL)是一种分布式机器学习技术,它通过保留数据隐私来促进这一过程。然而,FL也面临着一些挑战,如资源受限设备的计算和通信成本较高,以及由于边缘客户端数据异质性和存在离散数据而导致的泛化性能较差。在本文中,我们提出了 Gradient-Congruity Guided Federated Sparse Training (FedSGC) 方法,一种将动态稀疏训练和梯度一致性检查集成到联邦学习框架中的新方法,以解决这些问题。我们的方法利用了神经元中与全局模型相关但方向不一致的梯度包含无关或较少泛化信息的假设,并可以在稀疏训练过程中进行剪枝。相反,与全局模型方向一致的梯度可以以更高的优先级进行生长。这样,FedSGC 可以在降低本地计算和通信开销的同时,增强 FL 的泛化能力。我们在具有挑战性的非均匀设置中评估了我们的方法,结果表明,它在不同场景下的竞争精度与最先进的 FL 方法相当,同时最小化计算和通信成本。
https://arxiv.org/abs/2405.01189
Hyperspectral Imaging (HSI) serves as an important technique in remote sensing. However, high dimensionality and data volume typically pose significant computational challenges. Band selection is essential for reducing spectral redundancy in hyperspectral imagery while retaining intrinsic critical information. In this work, we propose a novel hyperspectral band selection model by decomposing the data into a low-rank and smooth component and a sparse one. In particular, we develop a generalized 3D total variation (G3DTV) by applying the $\ell_1^p$-norm to derivatives to preserve spatial-spectral smoothness. By employing the alternating direction method of multipliers (ADMM), we derive an efficient algorithm, where the tensor low-rankness is implied by the tensor CUR decomposition. We demonstrate the effectiveness of the proposed approach through comparisons with various other state-of-the-art band selection techniques using two benchmark real-world datasets. In addition, we provide practical guidelines for parameter selection in both noise-free and noisy scenarios.
超分辨率成像(HSI)在遥感中是一个重要的技术。然而,高维度和数据量通常会带来显著的计算挑战。带选择对于在超分辨率图像中减少光谱重叠并保留固有关键信息至关重要。在这项工作中,我们提出了一种新的超分辨率带选择模型,通过将数据分解为低秩和平滑组件和稀疏组件。特别,我们通过应用$\ell_1^p$范数来保留空间-频谱平滑性,开发了一个通用的3D总方差(G3DTV)。通过采用交替方向乘子法(ADMM),我们推导出一种高效的算法,其中张量低秩性隐含于张量CUR分解。我们通过与各种最先进的带选择技术进行比较,证明了所提出方法的有效性。此外,我们还为噪声无党和噪声场景提供了参数选择的实际建议。
https://arxiv.org/abs/2405.00951
Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.
多元智能体强化学习(MARL)算法通常很难找到接近帕累托最优纳什均衡的战略,这很大程度上是因为缺乏有效的探索。在稀疏奖励环境中,问题进一步加剧,由于策略学习表现出更大的方差。本文介绍了一种名为MESA的新协作多智能体学习元探索方法。它通过首先从训练任务中确定代理器的局部高奖励状态-动作子空间,然后学习一系列多样化的探索策略来“覆盖”该子空间。这些训练探索策略可以与任何离散的MARL算法在测试时间任务中集成。我们首先展示了MESA在多级矩阵游戏中的优势。此外,实验结果表明,在稀疏奖励任务中,MESA在多个多智能体粒子环境和多智能体MuJoCo环境中取得了显著的更好的性能,并且具有在测试时间将任务泛化的能力。
https://arxiv.org/abs/2405.00902
Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
照片现实模拟在自动驾驶等应用中扮演着关键角色,因为神经辐射场(NeRFs)的进步可能允许通过自动创建数字3D资产来实现更好的可扩展性。然而,在街景中,由于主要是平行的相机运动和高速时的采样稀疏,重建质量下降。另一方面,应用程序通常要求从相机视角进行渲染,以准确模拟行为,如变道。在本文中,我们提出了几个见解,使得Lidar数据能够更好地用于改善街景中的NeRF质量。首先,我们的框架从Lidar中学习几何场景表示,并将其与隐式网格表示的辐射解码相结合,从而提供来自明确点云的更强的几何信息。其次,我们提出了一个鲁棒的可视化深度监督方案,允许通过累积使用密集的Lidar点。第三,我们从Lidar点生成增强的训练视图,以进一步改进。我们的见解使得在现实驾驶场景中产生了显著改进的新视图合成。
https://arxiv.org/abs/2405.00900
Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
Neural Radiance Fields (NeRF) 在 3D 重建和生成新视图方面已经取得了令人印象深刻的成果。 NeRF 中的关键挑战之一是编辑重构场景,例如物体移除,这需要在多个视图中保持一致并确保高质合成视角。之前的研究已经利用深度优先项,通常来自 LiDAR 或稀疏深度测量提供的 COLMAP,来提高 NeRF 中物体移除的性能。然而,这些方法要么代价高昂,要么费时。在本文中,我们提出了一种新方法,将单目深度估计与基于 NeRF 的物体移除模型相结合,显著减少了时间消耗,并提高了场景生成和物体移除的稳健性和质量。我们对 COLMAP 在 KITTI 数据集上的密集深度重建进行了详细的评估,以验证其深度图生成的准确性。我们的研究结果表明,COLMAP 可以作为当深度图缺失或昂贵无法获得时的有效地面真值深度图的替代。此外,我们将各种单目深度估计方法(例如 SpinNeRF)集成到移除 NeRF 模型中,以评估它们提高物体移除性能的能力。我们的实验结果突出了单目深度估计在极大地改善 NeRF 应用中的潜力。
https://arxiv.org/abs/2405.00630