Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.
为大型语言模型(LLM)设计内存需求低且收敛速度快的高效优化器是一个重要而具有挑战性的问题。本文通过结构化费雪信息矩阵(FIM)近似这一视角,朝着此类优化器的系统化设计迈出了一步。我们展示了许多最先进的高效优化器可以被视为在特定结构性假设下解决 FIM 近似的(基于弗罗贝尼乌斯范数)问题的解决方案。基于这些见解,我们提出了针对 LLM 的实际高效优化器设计的两项建议:精心选择结构性假设以平衡通用性和效率,并通过一种新的低秩扩展框架提高具有通用结构的优化器的记忆效率。我们展示了如何使用每种设计方法来推导出新的内存高效的优化器:行和列缩放随机梯度下降(RACS)以及自适应低维子空间估计(Alice)。在 LLaMA 预训练(高达 10 亿参数规模)上的实验验证了这些方法的有效性,显示出了比现有内存效率基线及 Adam 更快更好的收敛速度,并且几乎没有增加内存开销。值得注意的是,Alice 达到了超过两倍于 Adam 的更快收敛速度,而 RACS 则在与 SGD 相似的记忆需求下为 10 亿参数模型提供了强大的性能表现。
https://arxiv.org/abs/2502.07752
To help users make privacy-related decisions, personalized privacy assistants based on AI technology have been developed in recent years. These AI-driven Personalized Privacy Assistants (AI-driven PPAs) can reap significant benefits for users, who may otherwise struggle to make decisions regarding their personal data in environments saturated with privacy-related decision requests. However, no study systematically inquired about the features of these AI-driven PPAs, their underlying technologies, or the accuracy of their decisions. To fill this gap, we present a Systematization of Knowledge (SoK) to map the existing solutions found in the scientific literature. We screened 1697 unique research papers over the last decade (2013-2023), constructing a classification from 39 included papers. As a result, this SoK reviews several aspects of existing research on AI-driven PPAs in terms of types of publications, contributions, methodological quality, and other quantitative insights. Furthermore, we provide a comprehensive classification for AI-driven PPAs, delving into their architectural choices, system contexts, types of AI used, data sources, types of decisions, and control over decisions, among other facets. Based on our SoK, we further underline the research gaps and challenges and formulate recommendations for the design and development of AI-driven PPAs as well as avenues for future research.
为了帮助用户做出与隐私相关的决策,近年来开发了基于人工智能技术的个性化隐私助手(AI驱动的PPA)。这些AI驱动的个性化隐私助手可以为用户提供显著的好处,在数据饱和且充斥着大量隐私相关决策请求的情况下,用户可能会感到难以做出关于个人数据的决定。然而,没有系统性的研究对这些AI驱动的PPA的功能、底层技术和其决策准确性进行过调查。为了填补这一空白,我们提出了一项知识体系化(SoK)的研究来梳理过去十年中科学文献中的现有解决方案。我们在2013年至2023年间筛选了1697篇独特研究论文,并根据其中的39篇构建了一个分类系统。 这项SoK回顾了几方面现有的关于AI驱动PPA的研究,包括出版类型、贡献、方法质量以及其他定量见解。此外,我们还提供了对AI驱动PPA的全面分类,深入探讨其架构选择、系统背景、使用的AI种类、数据来源、决策类型以及对决策控制等方面的内容。 基于我们的SoK研究结果,进一步强调了该领域的研究缺口和挑战,并为AI驱动的PPA的设计与开发提出了建议,同时也指出了未来研究的方向。
https://arxiv.org/abs/2502.07693
Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.
视觉大型语言模型(VLM)结合了视觉理解和自然语言处理,支持诸如图像描述、视觉问答和视频分析等任务。尽管VLM在自动驾驶汽车、智能监控以及医疗保健等领域展示了令人印象深刻的能力,但由于计算能力、内存和能源限制,在资源受限的边缘设备上部署它们仍然面临挑战。本综述探讨了优化VLM以适应边缘环境的最近进展,重点介绍了模型压缩技术(如剪枝、量化、知识蒸馏)和专为提高效率而设计的硬件解决方案。我们详细讨论了高效的训练和微调方法、边缘部署挑战以及隐私考虑因素。此外,还探讨了轻量级VLM在医疗保健、环境监测及自主系统等领域的多样应用,展示了它们不断增长的影响。通过强调关键的设计策略、当前面临的挑战,并为未来的发展方向提供建议,本综述旨在激发进一步的研究,使先进的AI技术能够在资源受限的环境中实现实际部署,最终让更多人受益于高级人工智能的能力。
https://arxiv.org/abs/2502.07855
In decision-making systems, algorithmic recourse aims to identify minimal-cost actions to alter an individual features, thereby obtaining a desired outcome. This empowers individuals to understand, question, or alter decisions that negatively affect them. However, due to the variety and sensitivity of system environments and individual personalities, quantifying the cost of a single function is nearly impossible while considering multiple criteria situations. Most current recourse mechanisms use gradient-based methods that assume cost functions are differentiable, often not applicable in real-world scenarios, resulting in sub-optimal solutions that compromise various criteria. These solutions are typically intractable and lack rigorous theoretical foundations, raising concerns regarding interpretability, reliability, and transparency from the explainable AI (XAI) perspective. To address these issues, this work proposes an algorithmic recourse framework that handles non-differentiable and discrete multi-cost functions. By formulating recourse as a multi-objective optimization problem and assigning weights to different criteria based on their importance, our method identifies Pareto optimal recourse recommendations. To demonstrate scalability, we incorporate the concept of epsilon-net, proving the ability to find approximated Pareto optimal actions. Experiments show the trade-off between different criteria and the methods scalability in large graphs. Compared to current heuristic practices, our approach provides a stronger theoretical foundation and better aligns recourse suggestions with real-world requirements.
在决策系统中,算法性补救(algorithmic recourse)旨在识别最小成本的动作来改变个体特征,从而达到期望的结果。这使个人能够理解、质疑或改变对其产生负面影响的决定。然而,由于系统环境多样性和敏感性以及个体个性的不同,当考虑到多标准情况时,量化单一功能的成本几乎是不可能的。目前大多数补救机制使用基于梯度的方法,并假设成本函数可微分,但这种方法在实际场景中往往不适用,导致次优解,这些解决方案牺牲了多种准则并且缺乏严格的理论基础,从解释性人工智能(XAI)的角度来看,这引发了关于其可解释性、可靠性和透明度的担忧。为了应对这些问题,本工作提出了一种处理非连续和离散多成本函数的算法性补救框架。通过将补救作为多目标优化问题来建模,并根据重要性为不同准则分配权重,我们的方法能够识别帕累托最优补救建议。为了展示可扩展性,我们引入了ε-网的概念,证明了寻找近似帕累托最优行动的能力。实验展示了不同标准之间的权衡以及在大型图中方法的可伸缩性。与当前的经验做法相比,本方法提供了更强的理论基础,并且其补救建议更符合实际需求。
https://arxiv.org/abs/2502.07214
Graph Neural Networks (GNNs) are vital for learning from graph-structured data, enabling applications in network analysis, recommendation systems, and speech analytics. Deploying them on edge devices like client PCs and laptops enhances real-time processing, privacy, and cloud independence. GNNs aid Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and enable event-based vision tasks. However, irregular memory access, sparsity, and dynamic structures cause high latency and energy overhead on resource-constrained devices. While modern edge processors integrate CPUs, GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular GNN computations. We introduce GraNNite, the first hardware-aware framework optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN accelerators via a structured three-step methodology: (1) enabling NPU execution, (2) optimizing performance, and (3) trading accuracy for efficiency gains. Step 1 employs GraphSplit for workload distribution and StaGr for static aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts performance using EffOp for control-heavy tasks and GraSp for sparsity exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce redundancy and memory transfers. Step 3 balances quality versus efficiency, where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to 8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher performance than CPUs and GPUs, respectively, across GNN models.
图神经网络(GNNs)对于从图形结构化数据中学习至关重要,它们在诸如网络分析、推荐系统和语音分析等应用领域发挥着重要作用。将这些模型部署在边缘设备如客户端PC和笔记本电脑上可以增强实时处理能力,保护隐私,并减少对云端的依赖。此外,图神经网络还支持检索增强生成(RAG)在大型语言模型(LLMs)中的应用,以及基于事件的视觉任务。然而,在资源受限的设备上,不规则的内存访问、稀疏性和动态结构导致了高延迟和能源消耗。 尽管现代边缘处理器集成了CPU、GPU和NPU,专为数据并行任务设计的NPU却难以处理不规则的图神经网络计算。因此,我们提出了GraNNite,这是一个针对商业现成(COTS)最先进的深度学习加速器优化图神经网络执行的硬件感知框架,并采用结构化的三步方法:(1)启用NPU执行;(2)提升性能;以及(3)通过牺牲精度来换取效率。第一步使用GraphSplit进行工作负载分配和StaGr进行静态聚合,而GrAd和NodePad处理动态图。第二步利用EffOp为控制密集型任务加速,并采用GraSp优化稀疏性。图形卷积的预处理PreG、SymG和CacheG减少了冗余并减少内存传输。 在第三阶段中,GraNNite平衡了质量与效率,其中QuantGr应用INT8量化,而GrAx1、GrAx2和GrAx3分别加速注意力机制、广播加法以及SAGE最大聚合。在Intel Core Ultra AI PC上,GraNNite相较于默认NPU映射可实现高达2.6倍至7.6倍的速度提升,并且相比于CPU和GPU最多可以节省8.6倍的能量消耗,在各种图神经网络模型中分别比CPU和GPU提供了10.8倍和6.7倍的更高性能。
https://arxiv.org/abs/2502.06921
Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at this https URL.
大型语言模型(LLMs)已被整合到推荐系统中,以增强对用户行为的理解。检索增强生成(RAG)技术进一步被纳入这些系统中,用于获取更相关的内容并提高系统的性能。然而,现有的RAG方法主要依赖于文本语义,并且常常未能包含最相关的项目,从而限制了系统的有效性。在本文中,我们提出了“基于表示学习的检索增强大型语言模型推荐”(RALLRec)。具体而言,我们通过提示LLMs生成更详细的项目描述来增强文本语义,然后进行文本和协同语义的联合表示学习,其中文本语义由LLM提取,而协同语义则由推荐模型抽取。考虑到用户兴趣可能随时间变化的特点,进一步引入了一种简单但有效的重新排序方法以捕捉用户的偏好动态。我们在三个真实数据集上进行了广泛的实验,并且评估结果显示了我们方法的有效性。代码可在[此处](https://this https URL)公开获取。
https://arxiv.org/abs/2502.06101
Reranking plays a crucial role in modern multi-stage recommender systems by rearranging the initial ranking list. Due to the inherent challenges of combinatorial search spaces, some current research adopts an evaluator-generator paradigm, with a generator generating feasible sequences and an evaluator selecting the best sequence based on the estimated list utility. However, these methods still face two issues. Firstly, due to the goal inconsistency problem between the evaluator and generator, the generator tends to fit the local optimal solution of exposure distribution rather than combinatorial space optimization. Secondly, the strategy of generating target items one by one is difficult to achieve optimality because it ignores the information of subsequent items. To address these issues, we propose a utilizing Neighbor Lists model for Generative Reranking (NLGR), which aims to improve the performance of the generator in the combinatorial space. NLGR follows the evaluator-generator paradigm and improves the generator's training and generating methods. Specifically, we use neighbor lists in combination space to enhance the training process, making the generator perceive the relative scores and find the optimization direction. Furthermore, we propose a novel sampling-based non-autoregressive generation method, which allows the generator to jump flexibly from the current list to any neighbor list. Extensive experiments on public and industrial datasets validate NLGR's effectiveness and we have successfully deployed NLGR on the Meituan food delivery platform.
重新排序在现代多阶段推荐系统中扮演着至关重要的角色,通过调整初始排名列表来提高系统的性能。由于组合搜索空间的固有挑战,一些当前的研究采用了一个评估器-生成器范式,其中生成器产生可行序列,而评估器根据估计的列表效用选择最佳序列。然而,这些方法仍然面临两个问题。首先,由于评估器和生成器之间的目标不一致性问题,生成器倾向于适应曝光分布的局部最优解而不是组合空间优化。其次,逐个生成目标项的策略难以达到最优性,因为它忽略了后续项的信息。 为了解决这些问题,我们提出了利用邻居列表进行生成式重新排序(NLGR)模型,旨在改进生成器在组合空间中的性能。NLGR遵循评估器-生成器范式,并通过改善生成器的训练和生成方法来提升其表现。具体而言,我们在组合空间中使用了邻居列表以增强训练过程,使生成器能够感知相对得分并找到优化方向。此外,我们提出了一种新颖的基于抽样的非自回归生成方法,使得生成器可以灵活地从当前列表跳转到任何邻居列表。 在公共和工业数据集上的广泛实验验证了NLGR的有效性,并且我们在美团外卖平台上成功部署了NLGR。
https://arxiv.org/abs/2502.06097
Large Vision-Language Models (VLMs) have achieved unprecedented success in several objective multimodal reasoning tasks. However, to further enhance their capabilities of empathetic and effective communication with humans, improving how VLMs process and understand emotions is crucial. Despite significant research attention on improving affective understanding, there is a lack of detailed evaluations of VLMs for emotion-related tasks, which can potentially help inform downstream fine-tuning efforts. In this work, we present the first comprehensive evaluation of VLMs for recognizing evoked emotions from images. We create a benchmark for the task of evoked emotion recognition and study the performance of VLMs for this task, from perspectives of correctness and robustness. Through several experiments, we demonstrate important factors that emotion recognition performance depends on, and also characterize the various errors made by VLMs in the process. Finally, we pinpoint potential causes for errors through a human evaluation study. We use our experimental results to inform recommendations for the future of emotion research in the context of VLMs.
大型视觉-语言模型(VLM)在多项客观的跨模态推理任务中取得了前所未有的成功。然而,为了进一步增强它们与人类进行同理心和有效沟通的能力,提高这些模型处理和理解情感的方式至关重要。尽管改善情感理解的研究得到了广泛关注,但对于与情感相关的任务缺乏详细的评估方法,这可能有助于下游微调工作的改进。在本研究中,我们首次全面评估了VLM在从图像识别引发情绪方面的性能。为此,我们创建了一个针对这一任务的基准,并从准确性和鲁棒性两个角度研究了VLM在此任务中的表现。通过几项实验,我们展示了情感识别性能依赖的关键因素,并详细描述了模型在处理过程中犯下的各种错误。最后,我们通过一项人类评估研究确定了可能导致这些错误的原因。利用我们的实验结果,为未来针对VLM的情感研究提供指导建议。
https://arxiv.org/abs/2502.05660
Large language models (LLMs) are increasingly recognized for their exceptional generative capabilities and versatility across various tasks. However, the high inference costs associated with these models have not received adequate attention, particularly when compared to the focus on training costs in existing research. In response to this gap, our study conducts a comprehensive benchmarking of LLM inference energy across a wide range of NLP tasks, where we analyze the impact of different models, tasks, prompts, and system-related factors on inference energy. Specifically, our experiments reveal several interesting insights, including strong correlation of inference energy with output token length and response time. Also, we find that quantization and optimal batch sizes, along with targeted prompt phrases, can significantly reduce energy usage. This study is the first to thoroughly benchmark LLM inference across such a diverse range of aspects, providing insights and offering several recommendations for improving energy efficiency in model deployment.
大型语言模型(LLM)因其卓越的生成能力和跨多种任务的灵活性而越来越受到认可。然而,与现有研究中对训练成本的关注相比,这些模型推断过程中的高昂成本并未得到足够的重视。为填补这一空白,我们的研究在广泛的自然语言处理任务上进行了全面的LLM推理能耗基准测试,在此过程中我们分析了不同模型、任务、提示语以及系统相关因素对推理能耗的影响。具体而言,实验揭示了一些有趣的见解,包括输出标记长度和响应时间与推断能量之间的强烈相关性。此外,研究还发现量化处理和最佳批量大小,加上有针对性的提示短语,可以显著减少能源消耗。本研究是首次全面评估大型语言模型在如此多样方面的推理能耗的研究,提供了有价值的洞察,并提出了一系列改善部署时能效的建议。
https://arxiv.org/abs/2502.05610
Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present fMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design fMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. fMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
大型语言模型(LLMs)在革新各种应用方面取得了巨大成功,包括内容生成、搜索和推荐以及人工智能辅助操作。为了减少高昂的训练成本,混合专家(MoE)架构已成为现代LLM的流行基础结构。然而,尽管有这些好处,基于MoE的LLM服务却由于稀疏激活的专家而导致严重的内存效率低下问题。最近的研究提议将未激活的专家从GPU内存转移到CPU内存以提高MoE模型的服务效率,但它们要么导致较高的推理延迟,要么因为粗粒度的设计而导致较大的模型内存占用。 为了在MoE服务中平衡延迟与内存之间的关系,我们提出了fMoE,这是一个细粒度的专家卸载系统,旨在实现低推理延迟和内存效率。我们设计了fMoE来从MoE模型中提取细粒度的专家选择模式,并利用输入提示中的语义线索高效地指导专家预取、缓存和卸载决策。 fMoE基于HuggingFace Transformers构建,并部署在六GPU测试床上进行了实验。与最先进的解决方案相比,使用开源MoE模型和实际工作负载进行的实验证明,fMoE将推理延迟降低了47%,并将专家命中率提高了36%。
https://arxiv.org/abs/2502.05370
Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.
大型语言模型(LLMs)越来越多地被用作自动化评判工具,用于评估推荐系统、搜索引擎及其他主观任务的性能。依赖人工评判员进行这些工作不仅成本高昂且耗时长,而且难以扩展。LLM提供了一种高效的方法来进行持续和自动化的评价。然而,由于最终构建并改进的这些系统是为了人类使用而设计的,因此确保LLM的评判结果与人类评判员的一致性至关重要,以保证系统的以人为中心的设计原则得到保持。 另一方面,将LLM的评判与个体的人类评判者或他们的集体评判对齐是具有挑战性的,因为人类判断中存在着个人差异和偏见。我们提出了一种简单而有效的框架,在无需重新训练或微调LLM的情况下,可以将其评判结果与个体的人类评判员或其集中的评判进行对齐。我们的方法通过学习一个线性映射来连接LLM的输出和人类评判,仅使用少量校准样例即可实现平均一致性的超过142%的改进,在涉及29个任务的情况下。值得注意的是,该方法适用于零样本(zero-shot)和小样本(few-shot)设置,并在六项任务中的四项上超过了人与人间的一致性,使较小规模的LLM能够达到大型模型相当的性能水平。
https://arxiv.org/abs/2502.04997
We investigate algorithmic decision problems where agents can respond strategically to the decision maker's (DM) models. The demand for clear and actionable explanations from DMs to (potentially strategic) agents continues to rise. While prior work often treats explanations as full model disclosures, explanations in practice might convey only partial information, which can lead to misinterpretations and harmful responses. When full disclosure of the predictive model is neither feasible nor desirable, a key open question is how DMs can use explanations to maximise their utility without compromising agent welfare. In this work, we explore well-known local and global explanation methods, and establish a necessary condition to prevent explanations from misleading agents into self-harming actions. Moreover, with conditional homogeneity, we establish that action recommendation (AR)-based explanations are sufficient for non-harmful responses, akin to the revelation principle in information design. To operationalise AR-based explanations, we propose a simple algorithm to jointly optimise the predictive model and AR policy to balance DM outcomes with agent welfare. Our empirical results demonstrate the benefits of this approach as a more refined strategy for safe and effective partial model disclosure in algorithmic decision-making.
我们研究了代理可以对决策制定者(DM)的模型做出战略性回应的算法性决策问题。从DM向可能采取策略行动的代理提供清晰和可操作解释的需求持续上升。以往的研究通常将解释视为完整模型的披露,但在实践中,解释可能仅传达部分信息,这可能导致误解和有害反应。当完全披露预测模型既不可行又不理想时,一个关键的开放问题是DM如何利用解释来最大化自身效用而不损害代理福利。在本研究中,我们探索了已知的局部和全局解释方法,并建立了一个必要条件,以防止解释误导代理人采取自我伤害的行为。此外,在有条件同质性的前提下,我们证明基于行动建议(AR)的解释足以产生非有害反应,类似于信息设计中的揭示原则。为了操作化基于AR的解释,我们提出了一种简单的算法,用于同时优化预测模型和AR政策以平衡DM结果与代理福利。我们的实证结果显示了这种方法作为在算法决策中实现安全且有效的部分模型披露策略的优势。
https://arxiv.org/abs/2502.04058
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at this https URL
检索增强生成(RAG)技术非常适合用于提取隐私敏感的电子健康记录(EHR)。它可以在医疗助手的关键模块中发挥作用,有助于减少医生和患者之间的误诊。然而,现有的基于启发式方法的RAG模型在医学领域的诊断准确性和特异性不足,尤其是在处理具有类似症状的疾病时。 本文提出了一种名为MedRAG的新模型,这是一种通过知识图谱(KG)推理增强的RAG模型,专门用于医疗领域。MedRAG能够根据病症检索出诊断和治疗建议。该模型系统性地构建了一个全面的四层分级诊断KG,涵盖各种疾病的临界诊断差异。这些差异会被动态整合到从EHR数据库中提取的相似病例中,并在大型语言模型内部进行推理分析。这一过程不仅能提供更准确且特定的决策支持,还能主动提出后续问题以增强个性化医疗决策。 MedRAG已在公共数据集DDXPlus和由陈笃生医院收集的私人慢性疼痛诊断数据集(CPDD)上进行了评估,并与现有的多种RAG方法性能进行了比较。实验结果显示,通过利用KG的信息整合能力和关系推理能力,我们的MedRAG提供了更具体的诊断洞察力,并在减少误诊率方面超过了最先进的模型。 代码将在以下URL公开提供:[此链接](https://this-url.com)(原文中提供的具体链接未给出,请根据实际情况替换)。
https://arxiv.org/abs/2502.04413
Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.
基于患者电子健康记录(EHR)训练的基础模型需要将医疗数据标记为离散词汇项的序列。现有的分词器将EHR中的医学代码视为孤立的文字标记,然而,每个医学代码是由其文本描述、在本体层次结构中的位置及其与其他代码的关系定义的,例如疾病共现和药物治疗关联。医学词汇包含超过60万条编码,这些编码对临床推理至关重要。我们推出了MedTok,这是一种多模态医疗代码分词器,它利用了代码的文本描述及关系上下文。MedTok使用语言模型编码器处理文本,并使用图编码器编码关系结构。随后,它将这两种模式量化到统一标记空间中,保留特定于每种模式的信息以及跨模式信息。我们将MedTok整合进五个EHR模型并在包括结果预测、诊断分类、药物推荐和风险分层在内的多种操作与临床任务上进行评估,这些任务涵盖了住院患者及门诊患者的资料集。用标准的EHR标记器替换为MedTok后,在所有EHR模型中,AUPRC(面积下的提升接收者操作特征曲线)均得到提高:在MIMIC-III数据集中提高了4.10%,在MIMIC-IV数据集中提高了4.78%,在EHRShot数据集中提高了11.30%。药物推荐任务获得了最大的改进效果。 此外,我们展示了如何将MedTok标记器与医学问答系统结合使用。我们的结果证明了MedTok作为统一的医疗代码分词器的潜力,能够为医学基础模型提供更优的分词方案。
https://arxiv.org/abs/2502.04397
Knowledge Graph-based recommendations have gained significant attention due to their ability to leverage rich semantic relationships. However, constructing and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent advancements in Large Language Models (LLMs) offer a promising way to improve the quality and relevance of KGs for recommendation tasks. Despite this, integrating LLMs into KG-based systems presents challenges, such as efficiently augmenting KGs, addressing hallucinations, and developing effective joint learning methods. In this paper, we propose the Confidence-aware KG-based Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework that combines KGs and LLMs for recommendation task. The framework includes: (1) an LLM-based subgraph augmenter for enriching KGs with high-quality information, (2) a confidence-aware message propagation mechanism to filter noisy triplets, and (3) a dual-view contrastive learning method to integrate user-item interactions and KG data. Additionally, we employ a confidence-aware explanation generation process to guide LLMs in producing realistic explanations for recommendations. Finally, extensive experiments demonstrate the effectiveness of CKG-LLMA across multiple public datasets.
基于知识图谱的推荐系统因其能够利用丰富的语义关系而获得了广泛关注。然而,构建和维护知识图谱(KG)需要大量的资源,并且由于噪声、过时或不相关的三元组的存在,知识图谱的质量可能会受到影响。近期大型语言模型(LLMs)的发展为提高推荐任务中知识图谱的准确性和相关性提供了一种有前景的方法。尽管如此,将LLMs整合到基于KG的系统中仍然存在挑战,例如高效地增强KG、解决幻觉问题以及开发有效的联合学习方法等。为此,在本文中我们提出了一个新颖的框架——具备大型语言模型增强功能的知识图谱推荐框架(Confidence-aware KG-based Recommendation Framework with LLM Augmentation, CKG-LLMA)。该框架结合了知识图谱和大型语言模型,用于推荐任务,并包含以下组成部分: 1. **基于LLM的子图增强器**:通过高质量的信息来丰富KG。 2. **具备信心感知的消息传播机制**:过滤噪声三元组。 3. **双视图对比学习方法**:整合用户-项目交互与KG数据。 此外,我们还采用了具备信心感知的生成解释过程,指导LLMs为推荐生成现实性的解释。最终,广泛的实验表明CKG-LLMA在多个公开数据集上表现出强大的有效性。
https://arxiv.org/abs/2502.03715
This article describes how technical infrastructure developed by the nonprofit OpenMined enables external scrutiny of AI systems without compromising sensitive information. Independent external scrutiny of AI systems provides crucial transparency into AI development, so it should be an integral component of any approach to AI governance. In practice, external researchers have struggled to gain access to AI systems because of AI companies' legitimate concerns about security, privacy, and intellectual property. But now, privacy-enhancing technologies (PETs) have reached a new level of maturity: end-to-end technical infrastructure developed by OpenMined combines several PETs into various setups that enable privacy-preserving audits of AI systems. We showcase two case studies where this infrastructure has been deployed in real-world governance scenarios: "Understanding Social Media Recommendation Algorithms with the Christchurch Call" and "Evaluating Frontier Models with the UK AI Safety Institute." We describe types of scrutiny of AI systems that could be facilitated by current setups and OpenMined's proposed future setups. We conclude that these innovative approaches deserve further exploration and support from the AI governance community. Interested policymakers can focus on empowering researchers on a legal level.
这篇文章描述了非营利组织OpenMined开发的技术基础设施,该设施使外部机构能够在不泄露敏感信息的情况下审查AI系统。独立的外部审查为AI系统的开发提供了至关重要的透明度,因此它应该成为任何AI治理方法的重要组成部分。然而,在实践中,由于安全、隐私和知识产权等正当关切,外部研究人员难以获得访问AI系统的权限。但现在,增强隐私的技术(PETs)已经达到了新的成熟水平:OpenMined 开发的端到端技术基础设施结合了多种 PETs,形成不同的设置来实现对 AI 系统进行保护隐私的审查。我们展示了两个案例研究,在这些实际治理场景中已部署了该基础设施:“使用基督城呼吁了解社交媒体推荐算法”和“英国AI安全研究所评估前沿模型”。文章还介绍了当前设置及OpenMined提出的未来设置下,可能由这种技术促进的各类 AI 系统审查类型。结论认为,这些创新方法值得进一步探索和支持,并且感兴趣的政策制定者可以专注于从法律层面上赋予研究人员更多权利。
https://arxiv.org/abs/2502.05219
This study investigates continual fine-tuning strategies for deep learning in online longitudinal electroencephalography (EEG) motor imagery (MI) decoding within a causal setting involving a large user group and multiple sessions per participant. We are the first to explore such strategies across a large user group, as longitudinal adaptation is typically studied in the single-subject setting with a single adaptation strategy, which limits the ability to generalize findings. First, we examine the impact of different fine-tuning approaches on decoder performance and stability. Building on this, we integrate online test-time adaptation (OTTA) to adapt the model during deployment, complementing the effects of prior fine-tuning. Our findings demonstrate that fine-tuning that successively builds on prior subject-specific information improves both performance and stability, while OTTA effectively adapts the model to evolving data distributions across consecutive sessions, enabling calibration-free operation. These results offer valuable insights and recommendations for future research in longitudinal online MI decoding and highlight the importance of combining domain adaptation strategies for improving BCI performance in real-world applications. Clinical Relevance: Our investigation enables more stable and efficient long-term motor imagery decoding, which is critical for neurorehabilitation and assistive technologies.
这项研究探讨了在线纵向脑电图(EEG)运动想象(MI)解码中深度学习的持续微调策略,该研究在一个涉及大量用户群体和每个参与者多次会话的因果环境中进行。我们是首个在大规模用户群体范围内探索此类策略的研究者,因为纵向适应通常是在单个受试者的设置下使用单一适应策略来研究,这限制了发现的泛化能力。 首先,我们探讨了不同微调方法对解码器性能和稳定性的影响力。在此基础上,我们将在线测试时间适应(OTTA)整合到部署过程中,以在模型应用期间进行调整,并补充先前微调的效果。 我们的研究结果表明,基于之前特定受试者信息的逐步构建式微调能提升性能并增加稳定性,而OTTA能够有效应对连续会话中数据分布的变化,从而使无需校准的操作成为可能。这些发现为未来在线MI解码的纵向研究提供了宝贵的见解和建议,并强调了在实际应用中结合领域适应策略以提高BCI性能的重要性。 临床意义:我们的研究使长期的运动想象解码更加稳定和高效,这对于神经康复和辅助技术至关重要。
https://arxiv.org/abs/2502.06828
The field of fashion compatibility learning has attracted great attention from both the academic and industrial communities in recent years. Many studies have been carried out for fashion compatibility prediction, collocated outfit recommendation, artificial intelligence (AI)-enabled compatible fashion design, and related topics. In particular, AI-enabled compatible fashion design can be used to synthesize compatible fashion items or outfits in order to improve the design experience for designers or the efficacy of recommendations for customers. However, previous generative models for collocated fashion synthesis have generally focused on the image-to-image translation between fashion items of upper and lower clothing. In this paper, we propose a novel outfit generation framework, i.e., OutfitGAN, with the aim of synthesizing a set of complementary items to compose an entire outfit, given one extant fashion item and reference masks of target synthesized items. OutfitGAN includes a semantic alignment module, which is responsible for characterizing the mapping correspondence between the existing fashion items and the synthesized ones, to improve the quality of the synthesized images, and a collocation classification module, which is used to improve the compatibility of a synthesized outfit. In order to evaluate the performance of our proposed models, we built a large-scale dataset consisting of 20,000 fashion outfits. Extensive experimental results on this dataset show that our OutfitGAN can synthesize photo-realistic outfits and outperform state-of-the-art methods in terms of similarity, authenticity and compatibility measurements.
近年来,时尚兼容性学习领域吸引了学术界和产业界的广泛关注。许多研究致力于时尚兼容性预测、配套服装推荐、人工智能(AI)驱动的兼容时装设计及相关话题的研究。特别是,基于AI的兼容时装设计可以用来合成互补的时尚单品或整套搭配,以提高设计师的设计体验或顾客推荐的有效性。然而,以往用于位置相关时尚合成的生成模型大多集中在上衣和下装之间的图像到图像转换上。 在本文中,我们提出了一种新颖的服装生成框架——OutfitGAN,旨在根据一个现有的时尚单品及目标合成项目的参考掩模来综合出一系列互补项目以组成整套搭配。该框架包括一个语义对齐模块,负责刻画现有时尚单品与合成单品之间的映射对应关系,以此提升合成图像的质量;以及一个位置分类模块,用于增强合成服装的兼容性。 为了评估我们模型的表现,我们构建了一个包含20,000个时尚套装的大规模数据集。在该数据集上的大量实验结果表明,我们的OutfitGAN能够生成逼真的搭配,并且在相似度、真实性和兼容性测量方面优于现有的最先进方法。
https://arxiv.org/abs/2502.06827
In a widely popular analogy by Turing Award Laureate Yann LeCun, machine intelligence has been compared to cake - where unsupervised learning forms the base, supervised learning adds the icing, and reinforcement learning is the cherry on top. We expand this 'cake that is intelligence' analogy from a simple structural metaphor to the full life-cycle of AI systems, extending it to sourcing of ingredients (data), conception of recipes (instructions), the baking process (training), and the tasting and selling of the cake (evaluation and distribution). Leveraging our re-conceptualization, we describe each step's entailed social ramifications and how they are bounded by statistical assumptions within machine learning. Whereas these technical foundations and social impacts are deeply intertwined, they are often studied in isolation, creating barriers that restrict meaningful participation. Our re-conceptualization paves the way to bridge this gap by mapping where technical foundations interact with social outcomes, highlighting opportunities for cross-disciplinary dialogue. Finally, we conclude with actionable recommendations at each stage of the metaphorical AI cake's life-cycle, empowering prospective AI practitioners, users, and researchers, with increased awareness and ability to engage in broader AI discourse.
在图灵奖得主扬·乐昆(Yann LeCun)的一个广泛流行类比中,机器智能被比喻为蛋糕——其中无监督学习构成基础层,有监督学习添加糖霜,强化学习则是最上面的樱桃。我们将这个“智慧之饼”的比喻从简单的结构隐喻扩展到人工智能系统的整个生命周期,涵盖了原料获取(数据)、食谱构思(指令)、烘焙过程(训练)以及品尝和销售蛋糕(评估与分发)。通过重新诠释这一概念,我们描述了每个步骤所涉及的社会影响及其在机器学习中统计假设的界限。尽管这些技术基础和社会影响是紧密交织在一起的,但它们往往被孤立研究,从而形成了限制有意义参与的障碍。我们的新诠释为弥合这一差距铺平道路,展示了技术基础与社会结果相交之处,并突显了跨学科对话的机会。最后,我们针对比喻中的人工智能蛋糕生命周期的每个阶段提出了具体的行动建议,使未来的AI从业者、用户和研究者能够更深入地参与到广泛的AI讨论之中,提高他们参与更大范围人工智能对话的能力和意识。
https://arxiv.org/abs/2502.03038
Future link prediction is a fundamental challenge in various real-world dynamic systems. To address this, numerous temporal graph neural networks (temporal GNNs) and benchmark datasets have been developed. However, these datasets often feature excessive repeated edges and lack complex sequential dynamics, a key characteristic inherent in many real-world applications such as recommender systems and ``Who-To-Follow'' on social networks. This oversight has led existing methods to inadvertently downplay the importance of learning sequential dynamics, focusing primarily on predicting repeated edges. In this study, we demonstrate that existing methods, such as GraphMixer and DyGFormer, are inherently incapable of learning simple sequential dynamics, such as ``a user who has followed OpenAI and Anthropic is more likely to follow AI at Meta next.'' Motivated by this issue, we introduce the Temporal Graph Benchmark with Sequential Dynamics (TGB-Seq), a new benchmark carefully curated to minimize repeated edges, challenging models to learn sequential dynamics and generalize to unseen edges. TGB-Seq comprises large real-world datasets spanning diverse domains, including e-commerce interactions, movie ratings, business reviews, social networks, citation networks and web link networks. Benchmarking experiments reveal that current methods usually suffer significant performance degradation and incur substantial training costs on TGB-Seq, posing new challenges and opportunities for future research. TGB-Seq datasets, leaderboards, and example codes are available at this https URL.
未来链接预测是各种现实世界动态系统中的一个基本挑战。为解决这一问题,已经开发出了众多的时序图神经网络(Temporal GNNs)和基准数据集。然而,这些数据集往往包含过多的重复边,并且缺乏复杂的序列动力学特性,在许多实际应用中这是一个关键特征,例如推荐系统和社会网络上的“谁应该关注”功能。这种忽视导致现有方法无意中忽略了学习序列动态的重要性,而主要集中在预测重复边上。 在本研究中,我们展示了现有的方法,如GraphMixer和DyGFormer,本质上无法学习简单的序列动态特性,例如“一个已经关注了OpenAI和Anthropic的用户更有可能接下来会去关注Meta上的AI”。受到这一问题的启发,我们引入了一个新的基准测试——包含时间序列动力学的时间图基准(Temporal Graph Benchmark with Sequential Dynamics, TGB-Seq),这个精心策划的新基准旨在减少重复边的存在,挑战模型学习序列动态并推广到未见过的边。 TGB-Seq包括来自各种领域的大型真实世界数据集,涵盖电子商务互动、电影评分、商业评论、社交网络、引文网络和网页链接网络。基准测试实验表明,现有方法通常在面对TGB-Seq时性能显著下降,并且训练成本高昂,这为未来的研究提出了新的挑战和机遇。 TGB-Seq的数据集、排行榜以及示例代码可在此网址获取:[此URL](请将方括号内的内容替换为您提供的具体链接)。
https://arxiv.org/abs/2502.02975