Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.
使用深度学习模型开发的自动化面部识别系统(FRS)在全球范围内用于身份验证和面部属性分析。这些模型的表现由其架构、优化/损失函数以及数据集之间的复杂相互作用决定。尽管FRS已经超过了人类级别的准确度,但在某些人口群体中依然存在显著差异。鉴于应用的广泛性,理解这三个组成部分——即模型架构、损失函数及面部图像数据集如何影响准确性与偏差权衡至关重要,以便设计更公正、无偏见的平台。 在这项工作中,我们对三个用于性别预测任务的FRS进行了深入分析,在此基础上通过各种架构修改生成了十个深度学习模型,并结合四种不同的损失函数在七个人脸数据集上进行测试,涵盖266种评估配置。我们的结果显示,这三个组成部分无论是单独还是组合起来都对准确性和偏差有着显著影响。 我们发现,某些数据集具有固有的特性,导致它们在不同模型之间表现出一致的行为,这与所选择的损失函数无关。此外,数据集的选择决定了模型被感知到的偏见——同样的模型在三个性别均衡的“真实世界”名人面部图像数据集中会显示出相反方向上的偏差。 研究面部嵌入显示,由于数据集的多样性,这些模型无法形成统一定义什么是“女性脸”与“男性脸”,从而导致了泛化能力的问题。我们为模型开发者提供了基于本研究结果进行模型开发和后续部署的建议。
https://arxiv.org/abs/2503.14138
We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user's current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.
我们介绍了一个由人工智能辅助的学术写作系统组件,包括引用推荐和引言撰写。该系统通过考虑用户当前文档的内容上下文来提供相关的引用建议。它以结构化的方式生成引言,并将研究贡献与先前的工作进行对比定位。我们通过定量评估展示了这些组件的有效性。最后,本文还呈现了关于研究人员如何在其写作工作流程中整合引用的定性研究成果。我们的发现表明,对于精确的人工智能辅助写作系统以及满足需求的简单有效方法存在需求。
https://arxiv.org/abs/2503.13771
Efficient management of end-of-life (EoL) products is critical for advancing circularity in supply chains, particularly within the construction industry where EoL strategies are hindered by heterogenous lifecycle data and data silos. Current tools like Environmental Product Declarations (EPDs) and Digital Product Passports (DPPs) are limited by their dependency on seamless data integration and interoperability which remain significant challenges. To address these, we present the Circular Construction Product Ontology (CCPO), an applied framework designed to overcome semantic and data heterogeneity challenges in EoL decision-making for construction products. CCPO standardises vocabulary and facilitates data integration across supply chain stakeholders enabling lifecycle assessments (LCA) and robust decision-making. By aggregating disparate data into a unified product provenance, CCPO enables automated EoL recommendations through customisable SWRL rules aligned with European standards and stakeholder-specific circularity SLAs, demonstrating its scalability and integration capabilities. The adopted circular product scenario depicts CCPO's application while competency question evaluations show its superior performance in generating accurate EoL suggestions highlighting its potential to greatly improve decision-making in circular supply chains and its applicability in real-world construction environments.
有效管理产品生命周期结束(EoL)阶段的产品对于推进供应链中的循环经济至关重要,尤其是在建筑业中,由于异质性生命周期数据和信息孤岛的存在,使得EoL策略的实施受到阻碍。当前工具如环境产品声明(EPDs)和数字产品护照(DPPs),受限于其依赖无缝数据集成与互操作性的需求,而这些仍然是重大挑战。为了应对这些问题,我们提出了循环建设产品本体论(CCPO),这是一个设计用来克服EoL决策中语义和数据异质性难题的应用框架。通过标准化词汇并促进供应链各方之间的数据整合,CCPO使得生命周期评估(LCA)及稳健的决策制定成为可能。 CCPO将分散的数据聚合为统一的产品来源,通过定制化的SWRL规则实现自动化的EoL建议,并且这些规则符合欧洲标准以及各利益相关方的具体循环经济服务级别协议。这展示了其可扩展性和集成能力。采用循环产品场景来描述CCPO的应用情况,而竞争力问题评估则表明了CCPO在生成准确的EoL建议方面的卓越性能,突显出它对于改进循环供应链中的决策制定及实际建筑环境应用具有巨大潜力。 总结来说,通过提供标准化的数据框架和自动化工具,CCPO能够帮助建筑业实现更加高效且环保的产品生命周期管理,从而支持循环经济的发展。
https://arxiv.org/abs/2503.13708
Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical operations, and the demand for high-quality analysis make the process tedious. To address these challenges, we aim to recommend query-code-result triplets tailored for new tables in tabular data analysis workflows. In this paper, we present TablePilot, a pioneering tabular data analysis framework leveraging large language models to autonomously generate comprehensive and superior analytical results without relying on user profiles or prior interactions. The framework incorporates key designs in analysis preparation and analysis optimization to enhance accuracy. Additionally, we propose Rec-Align, a novel method to further improve recommendation quality and better align with human preferences. Experiments on DART, a dataset specifically designed for comprehensive tabular data analysis recommendation, demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations further highlight its effectiveness in optimizing tabular data analysis workflows.
表格数据的分析在许多场景中至关重要,但高效地识别新的表格中最相关的数据分析查询和结果仍然是一个重大挑战。表格数据的复杂性、多样化的分析操作以及对高质量分析的需求使得这一过程变得繁琐。为了解决这些挑战,我们旨在推荐针对新表格的查询-代码-结果三元组,在表格数据的分析工作流程中提供定制化建议。 本文介绍了TablePilot框架,这是一个开创性的表格数据分析平台,利用大型语言模型自动生成全面且优秀的分析结果,并且无需依赖用户档案或先前的互动记录。该框架融合了在分析准备和优化方面的关键设计来提升准确度。此外,我们提出了一种新颖的方法Rec-Align,旨在进一步提高推荐质量,并更好地与人类偏好一致。 我们在DART数据集上进行了实验测试——这是一个专门为全面表格数据分析推荐设计的数据集,结果表明我们的框架是有效的。基于调优后的GPT-4o模型的TablePilot实现了77.0%的前五项推荐召回率。此外,人工评价进一步强调了其在优化表格数据分析工作流程中的有效性。 通过以上方法和技术的应用,TablePilot不仅能够有效提高表格数据处理和分析的工作效率,还能显著提升分析结果的质量与实用性,从而为用户带来更加智能化的数据探索体验。
https://arxiv.org/abs/2503.13262
Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.
尽管深度学习在图像和视频分割方面取得了重大进展,现有的模型仍然面临着跨域适应性和泛化能力的挑战。图像和视频分割是计算机视觉中的基本任务,在医疗保健、农业、工业检测以及自动驾驶等领域有着广泛的应用。随着大规模基础模型的出现,SAM(Segment Anything Model)的一个改进版本——SAM2得到了优化,并在复杂场景中展示了增强的性能。然而,对于SAM2在特定领域的适应性和局限性仍需进一步研究。本文系统地分析了SAM2在图像和视频分割中的应用,并对其在各个领域内的表现进行了评估。首先,我们将介绍图像分割的基础概念、基础模型分类以及SAM及SAM2的技术特性。随后,我们深入探讨了SAM2在静态图像和视频分割中的应用,特别强调其在医学影像等专业领域的性能及其跨域适应性的挑战。作为研究的一部分,我们审阅了超过200篇相关论文,以提供对主题的全面分析。最后,本文突出了SAM2在分割任务中的优缺点,指出了它所面临的技术挑战,并提出了未来的发展方向。这份综述为优化和实际应用SAM2提供了宝贵的见解和实用建议。
https://arxiv.org/abs/2503.12781
In multi-turn dialogues, large language models (LLM) face a critical challenge of ensuring coherence while adapting to user-specific information. This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations. While prior research has recognized these gaps, computational methods for their identification and resolution remain underexplored. We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps using intrinsic uncertainty quantification and feedback-driven refinement. CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context. We evaluate CPER on two real-world datasets: CCPE-M for preferential movie recommendations and ESConv for mental health support. Using A/B testing, human evaluators preferred CPER's responses 42% more often than baseline models in CCPE-M and 27% more often in ESConv. A qualitative human evaluation confirms that CPER's responses are preferred for maintaining contextual relevance and coherence, particularly in longer (12+ turn) conversations.
在多轮对话中,大型语言模型(LLM)面临一个关键挑战:确保连贯性的同时适应特定用户的信息需求。本研究引入了“人格知识差距”这一概念,即模型内部的理解与其进行连贯、个性化对话所需的知识之间的差异。尽管先前的研究已经认识到这些差距的存在,但用于识别和解决这些差距的计算方法仍处于探索阶段。 我们提出了一种名为对话偏好提取与推荐(CPER)的新框架,该框架能够通过内在不确定性量化以及基于反馈的改进动态检测并解决人格知识差距问题。CPER由三个关键模块组成:情境理解模块(用于提取偏好)、动态反馈模块(用于测量不确定性和调整人格一致性),以及根据累积用户上下文自适应生成响应的人格驱动回应生成模块。 我们使用两个真实世界的数据集对CPER进行了评估,分别是CCPE-M(偏好的电影推荐)和ESConv(心理健康支持)。通过A/B测试,人类评价者在CCPE-M中更偏好CPER的回应42%多于基线模型,在ESConv中则偏好37%。定性的人类评估确认了CPER的响应因其上下文相关性和连贯性的维持而受到更喜欢,尤其是在较长(12轮以上)的对话中更为明显。
https://arxiv.org/abs/2503.12556
The Internet of Things (IoT) has enabled diverse devices to communicate over the Internet, yet the fragmentation of IoT systems limits seamless data sharing and coordinated management. We have recently introduced SensorsConnect, a unified framework to enable seamless content and sensor data sharing in collaborative IoT systems, inspired by how the World Wide Web (WWW) enabled a shared and accessible space for information among humans. This paper presents the IoT Agentic Search Engine (IoT-ASE), a real-time search engine tailored for IoT environments. IoT-ASE leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) techniques to address the challenge of searching vast, real-time IoT data, enabling it to handle complex queries and deliver accurate, contextually relevant results. We implemented a use-case scenario in Toronto to demonstrate how IoT-ASE can improve service quality recommendations by leveraging real-time IoT data. Our evaluation shows that IoT-ASE achieves a 92\% accuracy in retrieving intent-based services and produces responses that are concise, relevant, and context-aware, outperforming generalized responses from systems like Gemini. These findings highlight the potential IoT-ASE to make real-time IoT data accessible and support effective, real-time decision-making.
物联网(IoT)使各种设备能够通过互联网进行通信,然而,物联网系统的碎片化限制了无缝数据共享和协同管理。最近,我们推出了SensorsConnect,这是一个统一的框架,旨在实现协作型物联网系统中的内容及传感器数据的无缝共享,其灵感来源于万维网(WWW),后者为人类之间的信息提供了一个共享且易于访问的空间。本文介绍了物联网代理搜索引擎(IoT-ASE),这是一款专为物联网环境定制的实时搜索引擎。IoT-ASE利用大型语言模型(LLMs)和检索增强生成(RAG)技术来应对大规模、实时物联网数据搜索这一挑战,从而能够处理复杂查询,并提供准确且上下文相关的结果。 我们在多伦多实施了一个用例场景,以展示如何通过利用实时物联网数据,IoT-ASE能提高服务质量建议。我们的评估表明,IoT-ASE在检索基于意图的服务时达到了92%的准确性,并产生了简洁、相关和情境感知的回答,优于像Gemini这样的系统的通用化响应。这些发现强调了IoT-ASE使实时物联网数据变得易于访问的能力,并支持有效的实时决策制定。
https://arxiv.org/abs/2503.12255
This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.
本文介绍了一个新颖的数据集REGEN(Reviews Enhanced with GEnerative Narratives),旨在评估推荐大型语言模型(LLMs)的对话能力,解决了现有数据集中主要侧重于顺序项目预测的局限性。REGEN通过填补两个关键的自然语言特征来扩展亚马逊产品评论数据集:(1) 用户点评,代表用户“引导”查询,这些查询导致了后续项目的选定;以及 (2) 故事线(叙述),每个推荐项目都与之相关联,并考虑到了之前的上下文。故事线包括对产品的推荐、购买解释和用户的偏好总结。 此外,我们还为对话推荐任务建立了一个端到端的建模基准,其中模型被训练以生成推荐及相应的叙述,这些叙述基于用户的历史记录(物品和点评)。对于这一联合任务,我们引入了LUMEN(LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives)这一建模框架,该框架使用一个大型语言模型作为核心来进行批评、检索和生成。同时,我们也通过传统的推荐模型和基于大型语言模型的推荐模型来训练数据集,并利用标准自动评分技术评估其质量。 我们的研究结果表明,引入点评能够提升推荐的质量,使推荐系统能够学习自然语言理解并将其与推荐信号相结合。此外,经过我们数据集训练的大规模语言模型在生成推荐以及上下文叙述方面表现出色,性能可媲美最先进的推荐系统和语言模型。
https://arxiv.org/abs/2503.11924
As frontier models become more capable, the community has attempted to evaluate their ability to enable cyberattacks. Performing a comprehensive evaluation and prioritizing defenses are crucial tasks in preparing for AGI safely. However, current cyber evaluation efforts are ad-hoc, with no systematic reasoning about the various phases of attacks, and do not provide a steer on how to use targeted defenses. In this work, we propose a novel approach to AI cyber capability evaluation that (1) examines the end-to-end attack chain, (2) helps to identify gaps in the evaluation of AI threats, and (3) helps defenders prioritize targeted mitigations and conduct AI-enabled adversary emulation to support red teaming. To achieve these goals, we propose adapting existing cyberattack chain frameworks to AI systems. We analyze over 12,000 instances of real-world attempts to use AI in cyberattacks catalogued by Google's Threat Intelligence Group. Using this analysis, we curate a representative collection of seven cyberattack chain archetypes and conduct a bottleneck analysis to identify areas of potential AI-driven cost disruption. Our evaluation benchmark consists of 50 new challenges spanning different phases of cyberattacks. Based on this, we devise targeted cybersecurity model evaluations, report on the potential for AI to amplify offensive cyber capabilities across specific attack phases, and conclude with recommendations on prioritizing defenses. In all, we consider this to be the most comprehensive AI cyber risk evaluation framework published so far.
随着前沿模型的能力不断增强,社区尝试评估这些模型是否能够被用来进行网络攻击。进行全面的评估和优先考虑防御措施是为安全地迎接通用人工智能(AGI)做好准备的关键任务。然而,目前的网络安全评估努力缺乏系统性,没有对各种攻击阶段进行系统推理,并未提供如何使用有针对性的防御措施的方向。在这项工作中,我们提出了一种新颖的人工智能网络能力评估方法,该方法旨在: 1. 检查从头到尾的攻击链; 2. 帮助识别人工智能威胁评估中的不足之处; 3. 协助防守者优先考虑针对性缓解措施,并进行由人工智能支持的对手模拟以支援红队行动。 为了实现这些目标,我们建议将现有的网络攻击链框架适应于AI系统。我们分析了Google威胁情报小组编录的超过12,000个在现实世界中尝试使用AI进行网络攻击的实际案例。通过这项分析,我们整理出七个具有代表性的网络攻击链原型,并进行了瓶颈分析以确定可能存在的人工智能驱动的成本破坏区域。 我们的评估基准包括50项新的挑战,这些挑战涵盖了网络攻击的不同阶段。基于此,我们制定了有针对性的网络安全模型评估方法,报告了AI在特定攻击阶段可能放大进攻性网络能力的潜力,并最终提出了关于优先考虑防御措施的建议。总的来说,我们认为这是迄今为止发布的最全面的人工智能网络安全风险评估框架。
https://arxiv.org/abs/2503.11917
Prompting LLMs offers an efficient way to guide output generation without explicit model training. In the e-commerce domain, prompting-based applications are widely used for tasks such as query understanding, recommender systems, and customer support. However, adapting LLMs to different tasks often requires extensive prompt engineering by domain experts, along with frequent updates to align with evolving business needs. Additionally, crafting fully unbiased natural language prompts remains a challenge for humans. To address these challenges, we propose a novel framework, Examples as the Prompt (EaP) which leverages labeled data to enhance prompts. Specifically, EaP automatically selects the most representative examples to maximize the few-shot capability of LLMs. It is efficient due to its unsupervised example selection and adaptive to potential data distribution shifts. We validate EaP on four real-world production use cases, demonstrating that it achieves comparable or even superior performance comparing to hand-crafted prompts designed by domain experts. Additionally, we introduce EaP_lite, which entirely replaces the natural language components of prompts with labeled examples. EaP_lite improves LLM inference speed by up to 70% without compromising performance. Latest online A/B test shows that using EaP and EaP_lite for data labeling can bring significant composite revenue gain by 0.06%.
提示大型语言模型(LLM)提供了一种在不进行显式模型训练的情况下引导输出生成的有效方法。在电子商务领域,基于提示的应用程序被广泛用于查询理解、推荐系统和客户服务等任务中。然而,将LLM适应不同任务通常需要领域专家进行大量的提示工程,并且需要频繁更新以适应不断变化的业务需求。此外,人类创作完全无偏见的自然语言提示仍是一项挑战。 为了解决这些难题,我们提出了一种新颖的框架——作为提示的例子(EaP),该框架利用标注数据来增强提示效果。具体而言,EaP自动选择最具代表性的例子以最大化LLM的少样本能力。由于其无监督的示例选择和对潜在数据分布变化的适应性,这种方法非常高效。我们在四个现实世界的生产用例上验证了EaP的有效性,并发现它在性能方面可以与领域专家精心设计的手工提示相媲美甚至超越。 此外,我们还引入了EaP_lite版本,该版本完全将自然语言组件替换为标注示例。EaP_lite能够在不牺牲性能的前提下,使LLM的推理速度提高最多70%。最新的在线A/B测试表明,使用EaP和EaP_lite进行数据标签处理可以带来显著的整体收入增长,增幅达0.06%。
https://arxiv.org/abs/2503.13518
An interesting class of commonsense reasoning problems arises when people are faced with natural disasters. To investigate this topic, we present \textsf{RESPONSE}, a human-curated dataset containing 1789 annotated instances featuring 6037 sets of questions designed to assess LLMs' commonsense reasoning in disaster situations across different time frames. The dataset includes problem descriptions, missing resources, time-sensitive solutions, and their justifications, with a subset validated by environmental engineers. Through both automatic metrics and human evaluation, we compare LLM-generated recommendations against human responses. Our findings show that even state-of-the-art models like GPT-4 achieve only 37\% human-evaluated correctness for immediate response actions, highlighting significant room for improvement in LLMs' ability for commonsense reasoning in crises.
当人们面临自然灾害时,一类有趣的常识推理问题便产生了。为了研究这一主题,我们提出了一个由人工策划的数据集——RESPONSE,该数据集中包含1789个标注实例和6037组旨在评估大型语言模型(LLM)在不同时间框架下应对灾难情况下的常识推理能力的问题。这个数据集包括问题描述、缺失资源、时效性解决方案及其理由,并且其中一部分内容由环境工程师进行了验证。 通过自动评价指标和人工评价,我们比较了大模型生成的建议与人类的实际回应。我们的研究结果表明,即使是像GPT-4这样的最先进模型,在应对紧急情况时也仅能达到37%的人工评估正确率,这凸显出在危机中提高LLM常识推理能力的巨大改进空间。
https://arxiv.org/abs/2503.11348
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool-use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi-step inference, real-time knowledge grounding, and tool-assisted decision-making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real-world evidence, reducing the risk of adverse events and improving therapeutic decision-making.
精确的精准医疗需要多模态自适应模型,这些模型可以生成个性化的治疗建议。我们介绍了TxAgent,这是一种人工智能代理,它利用多步骤推理和实时生物医学知识检索,在一个包含211个工具的工具箱中分析药物相互作用、禁忌症以及针对特定患者的治疗策略。TxAgent评估药物在分子水平、药代动力学水平和临床水平上的相互作用,基于患者合并症和正在使用的其他药物来识别禁忌症,并根据每位患者的具体特征定制治疗方法。它从多个生物医学来源检索并综合证据,评估药物与患者状况之间的相互作用,并通过迭代推理不断优化治疗建议。它依据任务目标选择工具,并执行结构化函数调用来解决需要临床推理和跨源验证的治疗任务。 ToolUniverse整合了来自可信来源的211个工具,包括自1939年以来所有美国FDA批准的药物以及Open Targets验证的临床见解。在五个新基准测试(DrugPC、BrandPC、GenericPC、TreatmentPC 和 DescriptionPC)上,TxAgent 的性能超越了领先的大型语言模型 (LLM)、使用工具的模型和推理代理,涵盖了 3,168 种药物推理任务以及456个个性化治疗场景。在开放性药物推理任务中,它达到了92.1%的准确率,超过了GPT-4o,并且在结构化多步推理方面优于DeepSeek-R1(671B)。TxAgent 能够处理不同形式的药物名称和描述,并通过结合多层次推断、实时知识定位和工具辅助决策来确保治疗建议符合已确立的临床指南和现实世界的证据,从而降低不良事件的风险并提升治疗决策的质量。
https://arxiv.org/abs/2503.10970
AI alignment, the challenge of ensuring AI systems act in accordance with human values, has emerged as a critical problem in the development of systems such as foundation models and recommender systems. Still, the current dominant approach, reinforcement learning with human feedback (RLHF) faces known theoretical limitations in aggregating diverse human preferences. Social choice theory provides a framework to aggregate preferences, but was not developed for the multidimensional applications typical of AI. Leveraging insights from a recently published urn process, this work introduces a preference aggregation strategy that adapts to the user's context and that inherits the good properties of the maximal lottery, a Condorcet-consistent solution concept.
AI对齐,即确保AI系统按照人类价值观行事的挑战,在诸如基础模型和推荐系统的开发中已作为一个关键问题浮现。然而,目前占主导地位的方法——基于人工反馈的强化学习(RLHF)在聚合多样的人类偏好方面存在已知的理论局限性。社会选择理论提供了一种聚合偏好的框架,但它并不是为AI典型的多维度应用而设计的。利用最近发表的一个抽样过程中的见解,这项工作介绍了一种适应用户上下文的偏好聚合策略,并继承了最大彩票方案(一种康多塞一致解决方案概念)的良好性质。 这段文字主要讨论的是在开发人工智能系统时遇到的一个关键问题:AI对齐。它指出了当前解决这一问题的主要方法——基于人工反馈的强化学习,存在一些理论上的局限性,特别是在处理多样化的用户偏好方面。同时,文章指出社会选择理论虽然提供了一种聚合偏好的框架,但并不适用于典型的AI应用场景。 为了克服这些挑战,作者提出了一种新的解决方案:利用一个最近发表的抽样过程中的见解来开发一种适应具体情境和用户的偏好聚合策略。这种新方法继承了“最大彩票方案”的优点,“最大彩票方案”是一种在康多塞一致框架下的理想解概念。通过这种方法,期望能够更好地确保AI系统的行为与人类的价值观相符合。
https://arxiv.org/abs/2503.10215
This paper investigates the complex interplay between AI developers, regulators, users, and the media in fostering trustworthy AI systems. Using evolutionary game theory and large language models (LLMs), we model the strategic interactions among these actors under different regulatory regimes. The research explores two key mechanisms for achieving responsible governance, safe AI development and adoption of safe AI: incentivising effective regulation through media reporting, and conditioning user trust on commentariats' recommendation. The findings highlight the crucial role of the media in providing information to users, potentially acting as a form of "soft" regulation by investigating developers or regulators, as a substitute to institutional AI regulation (which is still absent in many regions). Both game-theoretic analysis and LLM-based simulations reveal conditions under which effective regulation and trustworthy AI development emerge, emphasising the importance of considering the influence of different regulatory regimes from an evolutionary game-theoretic perspective. The study concludes that effective governance requires managing incentives and costs for high quality commentaries.
本文研究了人工智能开发者、监管者、用户和媒体之间复杂的相互作用,探讨如何建立值得信赖的人工智能系统。通过使用进化博弈理论和大型语言模型(LLMs),我们建模了在不同监管制度下这些参与者之间的战略互动。该研究探索了实现负责任治理的两种关键机制:通过媒体报道激励有效的监管,以及基于评论员推荐条件化用户的信任。 研究结果突显了媒体为用户提供信息的关键作用,并且可能扮演“软”监管的角色,调查开发者或监管者,作为机构人工智能监管(在许多地区仍然缺失)的一种替代方案。博弈论分析和基于LLM的模拟都揭示了有效监管以及可信的人工智能开发出现的条件,强调从进化博弈理论的角度考虑不同监管制度的影响的重要性。 研究结论认为,有效的治理需要管理高质量评论的相关激励与成本。
https://arxiv.org/abs/2503.09858
Methods to quantify uncertainty in predictions from arbitrary models are in demand in high-stakes domains like medicine and finance. Conformal prediction has emerged as a popular method for producing a set of predictions with specified average coverage, in place of a single prediction and confidence value. However, the value of conformal prediction sets to assist human decisions remains elusive due to the murky relationship between coverage guarantees and decision makers' goals and strategies. How should we think about conformal prediction sets as a form of decision support? Under what conditions do we expect the support they provide to be superior versus inferior to that of alternative presentations of predictive uncertainty? We outline a decision theoretic framework for evaluating predictive uncertainty as informative signals, then contrast what can be said within this framework about idealized use of calibrated probabilities versus conformal prediction sets. Informed by prior empirical results and theories of human decisions under uncertainty, we formalize a set of possible strategies by which a decision maker might use a prediction set. We identify ways in which conformal prediction sets and posthoc predictive uncertainty quantification more broadly are in tension with common goals and needs in human-AI decision making. We give recommendations for future research in predictive uncertainty quantification to support human decision makers.
在高风险领域如医学和金融中,量化来自任意模型预测的不确定性方法备受需求。校准预测(Conformal Prediction)作为一种生产具有指定平均覆盖率的预测集合的方法已经变得非常流行,在这种方法下,单个预测值及其置信度被一组预测所替代。然而,由于覆盖保证与决策者目标和策略之间的关系模糊不清,因此校准预测集如何帮助人类做出决定的价值仍然不明确。我们应该如何将校准预测集视为一种决策支持形式?在什么条件下,我们期望它们提供的支持优于或劣于其他不确定性表示方式所提供的支持? 本文概述了一个评估预测不确定性的信息信号的决策理论框架,并在此框架下对比了理想化的校准概率使用与校准预测集合的情况。结合先前的经验结果和人类在不确定性下的决策理论,我们形式化了一组可能的策略,这些策略描述了决策者如何利用一个预测集。 本文还指出了校准预测集以及更广泛的后验预测不确定性的量化方法是如何与人机协同决策中常见的目标和需求产生冲突的。最后,我们提出了对未来研究的建议,以支持人类决策者的预测不确定性量化工作。
https://arxiv.org/abs/2503.11709
Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at this https URL.
推荐系统(RecSys)在现代数字平台中广泛使用,并受到了大量关注。传统的推荐系统通常只专注于固定和简单的推荐场景,这使得它们难以泛化到新的、未见过的互动推荐任务上。近年来,大型语言模型(LLMs)的发展革新了RecSys的基础架构,推动其进化为更加智能且交互式的个性化推荐助手。然而,大多数现有的研究依赖于固定的特定任务提示模板来生成推荐并评估个性化的性能,这限制了对它们能力的全面评价。这是因为常用的数据库缺乏高质量的文字用户查询,这些查询能够反映现实世界的推荐场景,从而不适合用于评估基于LLM的个性化推荐助手。 为了解决这一缺口,我们引入了RecBench+,这是一个新的数据集基准测试工具,旨在评估在LLM时代处理复杂用户推荐需求的能力。RecBench+包括一系列多样化的查询,涵盖从严格的条件到柔和偏好的各种情况,并且难度等级各异。我们在RecBench+上对常用的LLMs进行了评估,并发现了以下结果:1)LLMs表现出作为推荐助手的初步能力;2)LLMs在处理明确说明了条件的问题时表现更好,但在需要推理或包含误导信息的问题面前则面临挑战。 我们的数据集已在[此链接](https://example.com/RecBench+)发布。
https://arxiv.org/abs/2503.09382
As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn hash functions using bottleneck features, which fail to produce representative hash codes from blended embeddings. Although contrastive hashing has shown superior performance, current approaches often treat image retrieval as a classification task, using category labels to create positive/negative pairs. Moreover, many methods fail to address the out-of-distribution (OOD) issue when models encounter external OOD queries or adversarial attacks. In this work, we propose a novel method to consolidate knowledge of hierarchical features and optimisation functions. We formulate the knowledge consolidation by introducing Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH). DaRF adaptively integrates shallow and deep representations into blended features, and SCH incorporates image fingerprints to enhance the adaptability of positive/negative pairings. These blended features further facilitate OOD detection and content-based recommendation, contributing to a secure AI-driven healthcare environment. Moreover, we present a content-guided ranking to improve the robustness and reproducibility of retrieval results. Our comprehensive assessments demonstrate that the proposed method could effectively recognise OOD samples and significantly outperform existing approaches in medical image retrieval (p<0.05). In particular, our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
随着人工智能和数字医学在医疗系统中的应用日益广泛,建立强大的治理框架对于确保伦理、安全和有效的实施至关重要。在这种背景下,医学图像检索成为临床数据管理的关键组成部分,在决策制定中发挥重要作用,并有助于保护患者信息的安全。 现有的方法通常使用瓶颈特征来学习哈希函数,但这种方法无法从混合嵌入中生成具有代表性的哈希码。尽管对比哈希技术已经显示出优越的性能,当前的方法往往将图像检索视为分类任务,利用类别标签创建正负样本对。此外,许多方法未能解决模型遇到外部OOD(Out-of-Distribution)查询或对抗性攻击时的问题。 在本研究中,我们提出了一种新的方法来整合层次特征和优化函数的知识。通过引入深度感知表示融合(DaRF)和结构感知对比哈希(SCH),我们制定了知识整合的方法。DaRF可以自适应地将浅层和深层表示集成到混合特征中,而SCH则采用图像指纹增强正负样本配对的适应性。这些混合特征进一步促进了OOD检测和基于内容的推荐,有助于构建安全的人工智能驱动医疗环境。 此外,我们提出了一种以内容为导向的排名方法,以提高检索结果的鲁棒性和可重复性。我们的全面评估表明,所提出的这种方法能够有效识别OOD样本,并在医学图像检索中显著优于现有方法(p<0.05)。特别是在解剖放射学数据集上,我们的方法实现了平均精度均值(MAP)提升5.6%到38.9%的成绩。
https://arxiv.org/abs/2503.09370
Adaptive questionnaires dynamically select the next question for a survey participant based on their previous answers. Due to digitalisation, they have become a viable alternative to traditional surveys in application areas such as political science. One limitation, however, is their dependency on data to train the model for question selection. Often, such training data (i.e., user interactions) are unavailable a priori. To address this problem, we (i) test whether Large Language Models (LLM) can accurately generate such interaction data and (ii) explore if these synthetic data can be used to pre-train the statistical model of an adaptive political survey. To evaluate this approach, we utilise existing data from the Swiss Voting Advice Application (VAA) Smartvote in two ways: First, we compare the distribution of LLM-generated synthetic data to the real distribution to assess its similarity. Second, we compare the performance of an adaptive questionnaire that is randomly initialised with one pre-trained on synthetic data to assess their suitability for training. We benchmark these results against an "oracle" questionnaire with perfect prior knowledge. We find that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties. Furthermore, we demonstrate that initialising the statistical model with synthetic data can (i) significantly reduce the error in predicting user responses and (ii) increase the candidate recommendation accuracy of the VAA. Our work emphasises the considerable potential of LLMs to create training data to improve the data collection process in adaptive questionnaires in LLM-affine areas such as political surveys.
自适应问卷能够根据受访者的先前答案动态选择下一个问题,在数字化的推动下,它们已成为传统调查在政治学等应用领域的可行替代方案。然而,这种问卷的一个限制是其依赖于数据来训练用于问题选择的模型。通常情况下,这类培训数据(即用户交互)事先不可用。为了应对这一挑战,我们进行了两项测试:一是检验大型语言模型 (LLM) 是否能够准确生成此类交互数据;二是探讨这些合成数据是否可以用来预训练适应性政治调查的统计模型。 为评估这种方法的有效性,我们利用了现有的瑞士投票建议应用(Smartvote VAA)的数据来验证: 1. 比较由LLM生成的合成数据分布与真实数据分布,以评估其相似度。 2. 通过比较随机初始化和基于合成数据预训练的自适应问卷的表现,来检验后者的适用性。我们还将结果与一个具有完全先验知识的理想“oracle”问卷进行对比。 我们的研究发现: - 通用大型语言模型(如GPT-4)能够从不同瑞士政党角度准确生成Smartvote问卷的答案。 - 使用合成数据初始化统计模型可以显著减少预测用户响应的误差,并提高VAA中候选人的推荐准确性。 这项工作强调了LLM在创造训练数据以改进适应性调查的数据收集过程中的巨大潜力,特别是在如政治调查等适合使用大型语言模型的领域。
https://arxiv.org/abs/2503.09311
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for both Earth and Life Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 20 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at this http URL.
近年来,人工智能(AI)技术,特别是大型语言模型(LLM),迅速发展并革新了科学发现的范式,使AI-for-Science (AI4Science) 成为一个充满活力且不断演进的领域。然而,在全面评估 AI4Science 方面仍缺乏有效的框架,尤其是从数据质量和模型能力的整体视角来看更是如此。因此,本研究提出了 SciHorizon 框架,旨在从科学数据和 LLM 两个角度综合评估 AI4Science 的准备度。 首先,我们介绍了一个用于评估面向AI的科学数据的一般性框架,该框架涵盖了四个关键维度:质量、FAIR 性(Findable, Accessible, Interoperable, Reusable)、解释性和合规性,并细分为15个子维度。基于2018年至2023年间在同行评议期刊上发表的数据资源论文,我们为地球科学和生命科学领域提供了面向AI的推荐数据集清单,这一贡献对整个领域而言是新颖且独创性的。 同时,为了评估跨多个学科的LLM的能力,我们根据五个核心指标——知识、理解力、推理能力、多模态性以及价值观——建立了16个评估维度,并涵盖了数学、物理、化学、生命科学及地球和空间科学。利用开发的基准数据集,我们对20余个具有代表性的开源和闭源LLM进行了全面评价。 所有结果均可在线公开访问,网址为 [此链接](http://example.com)(请将“此链接”替换为实际URL)。
https://arxiv.org/abs/2503.13503
LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining the original semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with the average number of correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information significantly enhances the LLMs' capabilities (increasing the number of correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs' repair capabilities. Based on our study, we also offer several recommendations for future research.
基于大型语言模型(LLM)的自动化程序修复方法因其先进的性能而吸引了广泛关注。然而,这些方法主要在少数几个知名数据集上进行评估,如Defects4J,这引发了它们在新数据集上的有效性问题。在这项研究中,我们对11种顶级表现的LLM进行了评估,使用的是DEFECTS4J-TRANS——一个从转换而来的新数据集,同时保持了原始语义不变。实验结果表明,在Defects4J和DEFECTS4J-TRANS两个数据集上,所有研究的LLM在自动程序修复(APR)任务中的泛化能力有限,正确且合理的补丁数量分别减少了49.48% 和 42.90%。进一步的研究发现,在修复提示中整合额外的相关信息可以显著提升LLM的能力(使正确和合理补丁的数量最多增加136.67%和121.82%),但这仍未能达到它们在原始数据集上的表现水平。这表明,仅靠优化提示工程不足以大幅提高LLM的修复能力。根据我们的研究结果,我们还提出了对未来研究的一些建议。
https://arxiv.org/abs/2503.09217