Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at this https URL
最近,大型多模态模型(LMMs)在自动驾驶研究中引起了广泛关注,并在各种新兴基准测试中展示了其令人鼓舞的能力。专门为此领域设计的LMM们展现出了有效的感知、规划和预测技能。然而,这些方法往往未能充分利用3D空间和时间元素,主要依赖于图像数据。因此,在动态驾驶环境中它们的效果受到限制。我们提出了一种将追踪信息作为额外输入的方法来恢复图像中未有效捕获的3D空间和时间细节。为此,我们介绍了一种新颖的方法,以嵌入此追踪信息到LMMs之中,从而增强其对驾驶场景的时空理解。 通过使用一个跟踪编码器,我们将三维跟踪数据纳入视觉查询中,同时避免了处理长时间视频序列或大量3D输入所带来的计算开销。此外,我们采用自我监督方法来预训练跟踪编码器,为LMM提供额外的上下文信息,从而显著提升了其在自动驾驶感知、规划和预测任务中的表现。 实验结果表明,我们的方法十分有效,在DriveLM-nuScenes基准测试上较基线模型获得了9.5%的准确率提升、7.04点ChatGPT得分增加以及整体评分上升了9.4%,而在DriveLM-CARLA上的最终得分为3.7%。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2503.14498
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at this https URL .
在这篇论文中,我们提出了一种新的多视图几何重建方法。近年来,大型视觉模型迅速发展,在各种任务上表现出色,并展示了出色的泛化能力。一些研究使用大型视觉模型进行单目深度估计,并间接应用于促进多视图重建任务。由于单目深度估计任务的不确定性,估算出的深度值通常不够准确,限制了它们在辅助多视图重建中的应用效果。我们提出将SfM(Structure from Motion)信息这一强大的多视图先验知识融入到深度估计过程中,从而提高深度预测的质量,并使其可以直接应用于多视图几何重建任务中。 实验结果表明,在公共的真实世界数据集上,我们的方法相比以往的单目深度估计工作显著提高了深度估计的质量。此外,我们还在包括室内、街景和空中视角在内的多种场景类型中评估了我们方法的重建质量,超越了当前最先进的MVS(多视图立体匹配)方法。 代码和补充材料可在[此处](https://this https URL)获取。请注意,链接中的URL部分需要替换为实际提供的网址。
https://arxiv.org/abs/2503.14483
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
为了成为有用的助手,AI代理必须了解自己的能力和局限性。这包括知道何时从参数知识中作答与使用工具之间的区别、何时信任工具输出以及何时保持谨慎或选择回避。由于这些能力难以通过监督微调来传授(因为需要构建能够反映特定代理能力的例子),我们提出了一种全新的教学方法:\emph{协作自我游戏}。我们构造了多代理合作,其中团队因集体正确地得出答案而获得奖励。这种元知识从互动结构中内置的激励机制中涌现出来。我们的重点在于拥有异构工具(针对特定语料库检索)的小规模代理社会,并且这些代理必须通过最小化自身努力来最大化成功所需的合作。 实验表明,多代理社区中的群体奖励可以诱导出在单个代理独立部署时能够\emph{转移}的策略,从而改进工具使用和选择性预测。
https://arxiv.org/abs/2503.14481
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
自动化特征工程在提高表格学习任务的预测模型性能方面扮演着关键角色。传统的自动化特征工程技术受限于其对固定、手动设计搜索空间内预定义转换方法的依赖,往往忽视了领域知识的作用。近期利用大型语言模型(LLMs)的进步使得将领域知识整合到特征工程过程中成为可能。然而,现有的基于LLM的方法要么直接采用提示技术,要么仅仅依靠验证分数来进行特征选择,未能充分利用先前特征发现实验中的见解或在特征生成与数据驱动性能之间建立有意义的推理联系。 为解决这些挑战,我们提出了LLM-FE这一创新框架,它结合了进化搜索和大型语言模型(LLMs)所提供的领域知识及推理能力,以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程问题表述为一个程序搜索问题,在此过程中,LLMs会迭代地提出新的特征变换程序,而数据驱动的反馈则指导整个搜索过程。 我们的实验结果表明,相较于最先进的基线方法,LLM-FE始终表现更优,并且在多种分类和回归基准测试中显著提升了表格预测模型的表现。
https://arxiv.org/abs/2503.14434
Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.
多图稀疏单目视觉同步定位与地图构建(SLAM)技术在处理内窥镜序列时,已证明能够有效恢复由于运动模糊、时间遮挡、工具交互或水柱等原因造成的频繁跟踪丢失。虽然稀疏的多图对于相机定位具有鲁棒性,但它们对环境表示的效果较差:这些地图中包含大量噪声和不准确重建的3D点,包括显著的异常值,并且更重要的是,密度极低,无法满足临床应用的要求。 我们提出了一种方法来去除异常值并增加现有的稀疏内窥镜多图CUDA SIFT-SLAM技术地图的稠密性。通过利用鲁棒于虚假匹配的LMedS算法,我们将NN LightDepth(一种用于尺度不变深度预测的方法)与稀疏的CUDA SIFT子图对齐。我们的系统在处理单目深度估计中的固有尺度模糊时,不仅能过滤异常值,还能生成可靠且稠密化的3D地图。 我们在C3VD仿真结肠数据集上提供了实验证据,表明我们能够以可接受的计算时间实现4.15毫米RMS精度的准确稠密化地图。同时,我们也报告了来自Endomapper数据集的真实结肠镜检查中的定性结果。
https://arxiv.org/abs/2503.14346
Predicting the words that a child is going to learn next can be useful for boosting language acquisition, and such predictions have been shown to be possible with both neural network techniques (looking at changes in the vocabulary state over time) and graph model (looking at data pertaining to the relationships between words). However, these models do not fully capture the complexity of the language learning process of an infant when used in isolation. In this paper, we examine how a model of language acquisition for infants and young children can be constructed and adapted for use in a Spatio-Temporal Graph Convolutional Network (STGCN), taking into account the different types of linguistic relationships that occur during child language learning. We introduce a novel approach for predicting child vocabulary acquisition, and evaluate the efficacy of such a model with respect to the different types of linguistic relationships that occur during language acquisition, resulting in insightful observations on model calibration and norm selection. An evaluation of this model found that the mean accuracy of models for predicting new words when using sensorimotor relationships (0.733) and semantic relationships (0.729) were found to be superior to that observed with a 2-layer Feed-forward neural network. Furthermore, the high recall for some relationships suggested that some relationships (e.g. visual) were superior in identifying a larger proportion of relevant words that a child should subsequently learn than others (such as auditory).
预测孩子即将学习的单词对于促进语言习得非常有用,研究显示使用神经网络技术(分析词汇状态随时间的变化)和图模型(考察词与词之间关系的数据)可以实现这样的预测。然而,在单独使用时,这些模型并不能完全捕捉到婴儿语言学习过程中的复杂性。 在本文中,我们探讨了如何为婴幼儿构建和调整一种语言习得模型,并将其应用于时空图卷积网络(STGCN),同时考虑在儿童语言学习过程中发生的不同类型的语言关系。我们提出了一种预测儿童词汇获取的新方法,并评估这种模型在不同类型语言关系下的有效性,从而得出关于模型校准和选择标准的深刻见解。 对该模型进行的评估表明,在使用感觉运动关系和语义关系时,用于预测新单词的模型平均准确率分别为0.733和0.729,优于使用两层前馈神经网络观察到的结果。此外,某些关系(如视觉)具有较高的召回率,意味着这些关系比其他类型的关系(例如听觉)更能识别出孩子后续应学习的相关词汇的较大比例。 通过这种方式,这项研究不仅为预测儿童语言习得提供了一个新颖的方法论框架,而且也为理解不同类型的语言关系如何影响这一过程提供了有价值的见解。
https://arxiv.org/abs/2503.14341
Machine learning interatomic potentials (MLIPs) are a promising tool to accelerate atomistic simulations and molecular property prediction. The quality of MLIPs strongly depends on the quantity of available training data as well as the quantum chemistry (QC) level of theory used to generate that data. Datasets generated with high-fidelity QC methods, such as coupled cluster, are typically restricted to small molecules and may be missing energy gradients. With this limited quantity of data, it is often difficult to train good MLIP models. We present an ensemble knowledge distillation (EKD) method to improve MLIP accuracy when trained to energy-only datasets. In our EKD approach, first, multiple teacher models are trained to QC energies and then used to generate atomic forces for all configurations in the dataset. Next, a student MLIP is trained to both QC energies and to ensemble-averaged forces generated by the teacher models. We apply this workflow on the ANI-1ccx dataset which consists of organic molecules with configuration energies computed at the coupled cluster level of theory. The resulting student MLIPs achieve new state-of-the-art accuracy on the out-of-sample COMP6 benchmark and improved stability for molecular dynamics simulations. The EKD approach for MLIP is broadly applicable for chemical, biomolecular and materials science simulations.
机器学习原子间势能(MLIP)是一种有前景的工具,可用于加速原子模拟和分子性质预测。MLIP的质量很大程度上取决于可用训练数据的数量以及生成这些数据所使用的量子化学(QC)理论水平。采用高保真度QC方法(如耦合簇法)生成的数据集通常仅限于小分子,并且可能缺少能量梯度。在这种有限数量的数据情况下,很难训练出优秀的MLIP模型。 我们提出了一种集合知识蒸馏(EKD)方法来改善仅以能量数据为训练目标的MLIP准确性。在我们的EKD方法中,首先,多教师模型被训练用于QC能量,并用来生成数据集中所有构型的原子力。接下来,在这些由教师模型产生的集合平均力的基础上,学生MLIP会被同时训练于QC能量和上述原子力。 我们在这个工作流程上使用了ANI-1ccx数据集,该数据集包括一系列有机分子,它们的结构能量通过耦合簇法理论计算得出。经过这种方法处理后得到的学生MLIP在COMP6基准测试中的外推样本中达到了新的最优精度,并且改善了分子动力学模拟的稳定性。 这种适用于MLIP的EKD方法广泛应用于化学、生物分子及材料科学模拟领域。
https://arxiv.org/abs/2503.14293
In sawmills, it is essential to accurately measure the raw material, i.e. wooden logs, to optimise the sawing process. Earlier studies have shown that accurate predictions of the inner structure of the logs can be obtained using just surface point clouds produced by a laser scanner. This provides a cost-efficient and fast alternative to the X-ray CT-based measurement devices. The essential steps in analysing log point clouds is segmentation, as it forms the basis for finding the fine surface details that provide the cues about the inner structure of the log. We propose a novel Point Transformer-based point cloud segmentation technique that learns to find the points belonging to the log surface in unsupervised manner. This is obtained using a loss function that utilises the geometrical properties of a cylinder while taking into account the shape variation common in timber logs. We demonstrate the accuracy of the method on wooden logs, but the approach could be utilised also on other cylindrical objects.
在锯木厂中,准确测量原材料(即木材原木)对于优化锯切过程至关重要。先前的研究表明,仅使用激光扫描仪产生的表面点云就可以获得关于原木内部结构的精确预测。这种方法为基于X射线CT的测量设备提供了一种成本效益高且快速的替代方案。分析原木点云的关键步骤是分割,因为它构成了寻找反映原木内部结构的精细表面细节的基础。 我们提出了一种基于Point Transformer的点云分割技术,该技术能够以无监督的方式学习找到属于原木表面的点。这种方法通过利用圆柱体的几何属性并考虑木材原木中常见的形状变化来实现,具体是通过一个损失函数实现的。我们在木质原木上展示了该方法的准确性,但该方法也可以应用于其他圆柱形物体。 这段文字介绍了在锯木行业中一种新的基于Point Transformer的点云分割技术,这种技术能够有效地识别和分离木材原木表面的数据点,并由此推断出原木内部结构的信息,从而优化锯切过程。这种方法相较于传统的X射线CT测量设备更加经济高效且快速,适用于对木材及其他圆柱形物体的研究与应用。
https://arxiv.org/abs/2503.14244
Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.
糖尿病性视网膜病变是导致糖尿病患者失明的主要原因之一,早期检测对于预防视力丧失至关重要。传统诊断方法往往耗时且易出错。深度学习技术的出现为提高诊断效率提供了创新解决方案。然而,单一深度学习模型在从复杂的视网膜图像中提取关键特征时经常面临挑战。为了应对这一问题,我们提出了一种有效的集成方法用于糖尿病性视网膜病变(DR)的诊断,该方法包括四个主要阶段:图像预处理、选择骨干预训练模型、特征增强和优化。 我们的方法首先从预处理阶段开始,应用CLAHE(对比度限制自适应直方图均衡化)来增强图像对比度,并使用伽马校正调整亮度以更好地识别特征。接着,我们采用离散小波变换(DWT),通过结合多分辨率细节进行图像融合,从而创建一个更丰富的数据集。 然后,我们选择了三种性能最佳的预训练模型——DenseNet169、MobileNetV1和Xception,用于多样化的特征提取。为了进一步提高特征提取效果,在每个模型中都集成了一种改进的残差块。 最后,使用萨尔皮群算法(SSA)优化加权集成方法来汇总这些基础模型的预测结果。SSA智能地探索权重空间,并找到最佳的基础架构配置以最大化集成模型的表现性能。 该提出的模型在多分类Kaggle APTOS 2019数据集上进行了评估,达到了88.52%的准确率。
https://arxiv.org/abs/2503.14209
Trajectory prediction facilitates effective planning and decision-making, while constrained trajectory prediction integrates regulation into prediction. Recent advances in constrained trajectory prediction focus on structured constraints by constructing optimization objectives. However, handling unstructured constraints is challenging due to the lack of differentiable formal definitions. To address this, we propose a novel method for constrained trajectory prediction using a conditional generative paradigm, named Controllable Trajectory Diffusion (CTD). The key idea is that any trajectory corresponds to a degree of conformity to a constraint. By quantifying this degree and treating it as a condition, a model can implicitly learn to predict trajectories under unstructured constraints. CTD employs a pre-trained scoring model to predict the degree of conformity (i.e., a score), and uses this score as a condition for a conditional diffusion model to generate trajectories. Experimental results demonstrate that CTD achieves high accuracy on the ETH/UCY and SDD benchmarks. Qualitative analysis confirms that CTD ensures adherence to unstructured constraints and can predict trajectories that satisfy combinatorial constraints.
轨迹预测有助于有效的规划和决策,而受限轨迹预测则将监管融入到预测中。近期在受限轨迹预测方面的进展集中在通过构建优化目标来处理结构化约束条件上。然而,由于缺乏可微分的形式定义,处理非结构化约束条件非常具有挑战性。为了解决这个问题,我们提出了一种使用条件生成范式的受控轨迹预测方法,名为可控轨迹扩散(CTD)。该方法的核心思想是任何一条轨迹都对应于对某个约束的符合程度。通过量化这种符合程度并将其作为条件处理,模型可以隐式地学习在非结构化约束条件下预测轨迹。CTD采用预训练评分模型来预测符合度(即分数),并将此分数用作条件以引导条件扩散模型生成轨迹。实验结果表明,在ETH/UCY和SDD基准测试中,CTD达到了很高的准确性。定性分析确认了CTD能够保证遵循非结构化约束,并且可以预测满足组合约束的轨迹。
https://arxiv.org/abs/2503.14203
Agile trajectory planning can improve the efficiency of multi-rotor Uncrewed Aerial Vehicles (UAVs) in scenarios with combined task-oriented and kinematic trajectory planning, such as monitoring spatio-temporal phenomena or intercepting dynamic targets. Agile planning using existing non-linear model predictive control methods is limited by the number of planning steps as it becomes increasingly computationally demanding. That reduces the prediction horizon length, leading to a decrease in solution quality. Besides, the fixed time-step length limits the utilization of the available UAV dynamics in the target neighborhood. In this paper, we propose to address these limitations by introducing variable time steps and coupling them with the prediction horizon length. A simplified point-mass motion primitive is used to leverage the differential flatness of quadrotor dynamics and the generation of feasible trajectories in the flat output space. Based on the presented evaluation results and experimentally validated deployment, the proposed method increases the solution quality by enabling planning for long flight segments but allowing tightly sampled maneuvering.
敏捷轨迹规划可以提升多旋翼无人驾驶航空器(UAV)在结合任务导向和运动学轨迹规划场景中的效率,例如监测时空现象或拦截动态目标。现有的非线性模型预测控制方法进行的敏捷规划受限于规划步骤的数量,因为随着步骤数量的增加,计算需求会变得越来越大。这导致了预测时域长度减少,并降低了解决方案的质量。此外,固定的采样时间间隔限制了在目标邻近区域利用可用UAV动态特性的能力。 本文提出通过引入可变的时间步长并将其与预测时域长度相结合来解决这些问题。使用简化的质点运动基元可以利用四旋翼动力学的微分平坦性,并在扁平输出空间中生成可行轨迹。基于所呈现的评估结果和实验验证部署,所提出的这种方法通过允许长时间飞行段规划但同时支持紧密采样的机动操作,提高了解决方案的质量。
https://arxiv.org/abs/2503.14184
End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention. Current methods aggregate historical information either through dense historical bird's-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection. However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of future is a continuation of past, we propose BridgeAD, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate that BridgeAD achieves state-of-the-art performance.
端到端的自动驾驶将任务统一在一个可微分框架内,使得规划导向优化成为可能,并且吸引了越来越多的关注。当前的方法通过密集的历史鸟瞰图(BEV)特征或查询稀疏的记忆库来聚合历史信息,这些方法遵循了来自检测领域的范式。然而,我们指出,这些范式要么忽略了在运动规划中使用历史信息,要么未能与多步骤的性质相匹配,这需要预测或计划多个未来的时间步。基于“未来是过去的延续”的哲学理念,我们提出了BridgeAD,它将运动和规划查询重新表述为多步骤查询,以便针对每个未来的时序进行差异化处理。这种设计使过去的历史预测和规划能够有效地应用于整个端到端系统的适当部分,从而根据时间步改进感知和运动规划。具体而言,当前帧的历史查询与感知相结合,而未来帧的查询则被整合到运动规划中。通过这种方式,在每一个时间步上聚合历史洞察力,填补了过去与未来的差距,增强了整体连贯性和端到端自动驾驶流水线的准确性。在nuScenes数据集的开放环和闭环设置中的广泛实验表明,BridgeAD实现了最先进的性能。
https://arxiv.org/abs/2503.14182
Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.
面对全球野火威胁的加剧,人们已经应用了多种基于遥感数据的计算机视觉技术来应对这一问题。然而,由于缺乏量化的、可解释的方式进行比较分析,用于野火预测的深度学习方法的选择仍不确定,这对提高预防措施和改进模型至关重要。本研究旨在全面对比四种常用深度学习架构(自编码器、ResNet、UNet 和基于 Transformer 的 Swin-UNet)在性能、效率及解释性方面的表现。通过使用包括近十年美国加利福尼亚州遥感数据的实际数据集,这些模型预测了第二天的野火蔓延情况。详细的量化比较分析表明,基于 Transformer 的 Swin-UNet 和 UNet 通常优于自编码器和 ResNet,尤其是在基于 Transformer 的 Swin-UNet 中具有先进的注意力机制,在 UNet 和基于 Transformer 的 Swin-UNet 中有效利用跳层连接(skip connections),这使得这些模型在预测准确性和模型解释性方面表现更佳。随后,我们对所有四个模型应用了 XAI(可解释的人工智能)技术,这种方法不仅增强了模型的清晰度和可信度,还促进了野火预测能力的集中改进。XAI 分析表明,UNet 和基于 Transformer 的 Swin-UNet 能够比其他两种模型更有效地专注于关键特征如“过往火灾掩模”、“干旱”和“植被”,同时保持对剩余特征的关注平衡,从而表现出色。我们全面比较分析所得出的见解对未来模型设计具有重要意义,并为不同场景下的模型选择提供了指导。
https://arxiv.org/abs/2503.14150
Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.
使用深度学习模型开发的自动化面部识别系统(FRS)在全球范围内用于身份验证和面部属性分析。这些模型的表现由其架构、优化/损失函数以及数据集之间的复杂相互作用决定。尽管FRS已经超过了人类级别的准确度,但在某些人口群体中依然存在显著差异。鉴于应用的广泛性,理解这三个组成部分——即模型架构、损失函数及面部图像数据集如何影响准确性与偏差权衡至关重要,以便设计更公正、无偏见的平台。 在这项工作中,我们对三个用于性别预测任务的FRS进行了深入分析,在此基础上通过各种架构修改生成了十个深度学习模型,并结合四种不同的损失函数在七个人脸数据集上进行测试,涵盖266种评估配置。我们的结果显示,这三个组成部分无论是单独还是组合起来都对准确性和偏差有着显著影响。 我们发现,某些数据集具有固有的特性,导致它们在不同模型之间表现出一致的行为,这与所选择的损失函数无关。此外,数据集的选择决定了模型被感知到的偏见——同样的模型在三个性别均衡的“真实世界”名人面部图像数据集中会显示出相反方向上的偏差。 研究面部嵌入显示,由于数据集的多样性,这些模型无法形成统一定义什么是“女性脸”与“男性脸”,从而导致了泛化能力的问题。我们为模型开发者提供了基于本研究结果进行模型开发和后续部署的建议。
https://arxiv.org/abs/2503.14138
Automatic anatomical landmark localization in medical imaging requires not just accurate predictions but reliable uncertainty quantification for effective clinical decision support. Current uncertainty quantification approaches often fall short, particularly when combined with normality assumptions, systematically underestimating total predictive uncertainty. This paper introduces conformal prediction as a framework for reliable uncertainty quantification in anatomical landmark localization, addressing a critical gap in automatic landmark localization. We present two novel approaches guaranteeing finite-sample validity for multi-output prediction: Multi-output Regression-as-Classification Conformal Prediction (M-R2CCP) and its variant Multi-output Regression to Classification Conformal Prediction set to Region (M-R2C2R). Unlike conventional methods that produce axis-aligned hyperrectangular or ellipsoidal regions, our approaches generate flexible, non-convex prediction regions that better capture the underlying uncertainty structure of landmark predictions. Through extensive empirical evaluation across multiple 2D and 3D datasets, we demonstrate that our methods consistently outperform existing multi-output conformal prediction approaches in both validity and efficiency. This work represents a significant advancement in reliable uncertainty estimation for anatomical landmark localization, providing clinicians with trustworthy confidence measures for their diagnoses. While developed for medical imaging, these methods show promise for broader applications in multi-output regression problems.
医学影像中的自动解剖标志定位不仅需要准确的预测,还需要可靠的风险量化以支持有效的临床决策。目前的风险量化方法往往不足,特别是在结合正态性假设时,系统性地低估了总的预测不确定性。本文引入了一种称为“符合预测”的框架,用于解剖标志定位中可靠的不确定度量化,这填补了自动地标定位中的一个重要空白。我们提出了两种新的方法,保证多输出预测的有限样本有效性:多输出回归作为分类的符合预测(M-R2CCP)及其变体——针对区域的多输出回归到分类的符合预测(M-R2C2R)。与传统的产生轴向对齐的超矩形或椭球形区域的方法不同,我们的方法生成灵活且非凸的预测区域,更好地捕捉地标预测中的潜在不确定性结构。通过在多个二维和三维数据集上的广泛经验评估,我们证明了我们的方法在有效性和效率上始终优于现有的多输出符合预测方法。这项工作标志着解剖标志定位中可靠不确定度估计的重要进步,为临床医生提供了他们诊断所需的可信信心措施。尽管这些方法是为医学影像开发的,但它们在多输出回归问题中的更广泛应用显示出潜力。
https://arxiv.org/abs/2503.14106
Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding horizon learning-based framework tailored for online estimation of spatio-temporal traffic state along with quantified uncertainty by using measurements from moving probe vehicles and downstream boundary inputs. Our framework is evaluated in both numerical and simulation datasets, showcasing its ability to handle irregular, sparse input data, adapt to time-shifted scenarios, and provide well-calibrated uncertainty estimates. The results demonstrate that the model captures complex traffic phenomena, including shockwaves and congestion propagation, while maintaining robustness to noise and sensor dropout. These advancements present a significant step toward online, adaptive traffic management systems.
准确的交通流量估算和预测对于高效管理交通运输系统至关重要,尤其是在城市化进程不断加快的情况下。传统方法依赖静态传感器,但其空间覆盖范围有限;而探测车辆提供的数据虽然丰富,但也更为稀疏且不规则。本文介绍了ON-Traffic框架,这是一种新颖的深度操作网络及基于滚动地平线学习的方法,专门用于利用移动探测车辆和下游边界输入的数据来进行时空交通状态的在线估计,并量化不确定性。 我们的框架在数值和模拟数据集中进行了评估,展示了其处理不规则、稀疏输入数据的能力,适应时间偏移场景,并提供经过良好校准的不确定度估计。结果表明,该模型能够捕捉复杂的交通现象,包括冲击波和拥堵传播,同时保持对噪声和传感器故障的强大鲁棒性。这些进展为在线自适应交通管理系统的发展迈出了重要一步。
https://arxiv.org/abs/2503.14053
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{this https URL}{this https URL}.
https://arxiv.org/abs/2503.14029
Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.
准确的身体尺寸和体重测量对于优化家禽管理、健康评估以及经济效益至关重要。本研究介绍了一种基于深度学习的创新模型,该模型利用多模态数据(来自不同视角的2D RGB图像、深度图及3D点云),实现非侵入性的鸭子身体维度和重量估计。为了支持模型训练,收集了一个包含1,023只林武鸭、超过5,000个样本的数据集,这些样本涵盖了多种姿态和条件。 该方法创新性地使用了PointNet++从点云中提取关键特征点,并计算相应的三维几何特性,同时将其与多视角卷积二维特征融合。随后利用Transformer编码器捕捉长距离依赖关系并优化特征互动,从而增强预测的鲁棒性。模型在八个形态参数上实现了6.33%的平均绝对百分比误差(MAPE)和0.953的R²值,展示了强大的预测能力。 与传统的手动测量方法不同,该提议的方法能够实现高精度估计,同时避免了对动物进行物理操作的需求,从而减少了动物的压力,并扩大了其应用范围。这项研究标志着深度学习技术在家禽身体尺寸和体重估计中的首次应用,为畜牧业的智能化和精确化管理提供了宝贵的参考,具有深远的实际意义。
https://arxiv.org/abs/2503.14001
The BI_RADS score is a probabilistic reporting tool used by radiologists to express the level of uncertainty in predicting breast cancer based on some morphological features in mammography images. There is a significant variability in describing masses which sometimes leads to BI_RADS misclassification. Using a BI_RADS prediction system is required to support the final radiologist decisions. In this study, the uncertainty information extracted by a Bayesian deep learning model is utilized to predict the BI_RADS score. The investigation results based on the pathology information demonstrate that the f1-scores of the predictions of the radiologist are 42.86%, 48.33% and 48.28%, meanwhile, the f1-scores of the model performance are 73.33%, 59.60% and 59.26% in the BI_RADS 2, 3 and 5 dataset samples, respectively. Also, the model can distinguish malignant from benign samples in the BI_RADS 0 category of the used dataset with an accuracy of 75.86% and correctly identify all malignant samples as BI_RADS 5. The Grad-CAM visualization shows the model pays attention to the morphological features of the lesions. Therefore, this study shows the uncertainty-aware Bayesian Deep Learning model can report his uncertainty about the malignancy of a lesion based on morphological features, like a radiologist.
BI-RADS评分是一种放射科医生使用的概率报告工具,用于根据乳腺X线摄影图像中的某些形态特征来表达预测乳腺癌不确定性的程度。由于对肿块的描述存在显著差异,有时会导致BI-RADS分类错误。使用一种BI-RADS预测系统是支持最终放射科医生决策所需的要求。在这项研究中,利用贝叶斯深度学习模型提取的不确定性信息来预测BI-RADS评分。基于病理学信息的研究结果表明,在BI-RADS 2、3和5数据集样本中,放射科医生预测的f1分数分别为42.86%、48.33% 和 48.28%,而模型性能的f1分数则为73.33%、59.60% 和 59.26%。此外,在使用的数据集BI-RADS 0类别中,该模型能够以75.86% 的准确率区分恶性与良性样本,并且可以正确地将所有恶性样本识别为BI-RADS 5。通过Grad-CAM可视化方法可以看出,该模型关注病变的形态特征。因此,这项研究表明,这种具有不确定性感知能力的贝叶斯深度学习模型可以根据形态特征像放射科医生一样报告关于病变恶性的不确定性。
https://arxiv.org/abs/2503.13999
Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at this https URL.
医学超声成像无处不在,但手动分析难以跟上步伐。自动分割技术可以帮助解决这个问题,但它需要大量的标注数据集,而这些数据往往是稀缺的。半监督学习利用未标记和有限标记的数据,是一种有前景的方法。目前最先进的方法使用一致性正则化或伪标签,但随着问题复杂性的增加,这些方法也变得越来越复杂。在缺乏足够标签的情况下,这些模型往往会依赖于图像中的噪声特征或将允许解剖学上不可能的分割结果。 本文提出了一种简单而有效的伪标签生成方法,并通过对抗性学习得到一个形状先验来正则化分割过程。具体来说,我们设计了一个编码器-孪生-解码器网络,在此框架下,形状先验作为隐式的形状模型,能够惩罚那些虽然不是与真实标注偏差太大但从解剖学角度看不可能的预测结果。 在不添加任何额外复杂性的条件下,我们的简单方法在两个不同分割协议下的基准测试中达到了最先进的性能。我们为未来的半监督医学图像分割提供了一个强有力的基线。代码可在该网址获取:[此链接应根据实际发布情况填写]。
https://arxiv.org/abs/2503.13987