Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at this https URL.
https://arxiv.org/abs/2603.12625
Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.
https://arxiv.org/abs/2603.12564
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.
本文是Perplexity对NIST/CAISI信息请求2025-0035的响应的一个略有改动的版本,概述了我们关于前沿AI代理安全性的观察和建议。这些见解基于Perplexity在受控环境和开放世界环境中运营被数百万用户和数千家企业使用的通用目的智能系统(agentic systems)的经验。 代理架构改变了代码与数据分离、权限边界以及执行可预测性等核心假设,从而产生了新的机密性、完整性和可用性的故障模式。我们对工具、连接器、托管边界和多代理协调的主要攻击面进行了映射,并特别强调了间接提示注入(indirect prompt injection)、代理人误解行为(confused-deputy behavior)以及长时间运行工作流中的级联失效等问题。 然后,我们将目前的防御措施评估为一个分层的安全栈:包括输入级别和模型级别的缓解措施、沙箱执行环境,以及对高风险操作进行确定性策略实施。最后,本文还指出了标准制定和技术研究方面的空白区域,例如适应性的安全基准测试、委派与权限控制的策略模型,以及遵循NIST风险管理原则的安全多代理系统设计指南。
https://arxiv.org/abs/2603.12230
Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.
大型语言模型(LLMs)通过从大规模训练语料库和用户互动中学习统计关联,部署后的系统可以揭示或推断关于个人的信息。然而,人们缺乏实用的方式来检查模型与他们的名字相关的具体内容。我们报告了一项正在进行的研究的中期发现,并引入了LMP2,这是一个基于浏览器的自我审计工具。在两项针对用户的实验(总共458名参与者)中,GPT-4o能够以至少60%的准确率预测出日常人物的50个特征中的11个,而参与者的反馈表明他们希望对LLM生成的相关性进行控制,尽管并非所有输出都被认为是隐私侵犯。为了验证我们的探测方法的有效性,我们评估了八种大型语言模型在公共人物和不存在的名字上的表现,观察到了稳定的名字条件关联与模型默认值之间存在明显的区分。 研究还揭示了一个更广泛的生成式人工智能评估危机:当输出具有概率性、依赖于上下文并通过用户互动来触发时,哪种模型-个人相关性被包括在内仍然是不明确的,并且操作化需要创建难以验证或比较的探测和指标。为了迈向可靠、可实施的人类中心化的LLM隐私审计,我们确定了研究中出现的九个摩擦点,并为未来工作及人类中心化的LLM隐私审计设计提供了建议。
https://arxiv.org/abs/2603.12094
Despite frequent double-blind review, demographic biases of authors still disadvantage the underrepresented groups. We present Fair-PaperRec, a MultiLayer Perceptron (MLP)-based model that addresses demographic disparities in post-review paper acceptance decisions while maintaining high-quality requirements. Our methodology penalizes demographic disparities while preserving quality through intersectional criteria (e.g., race, country) and a customized fairness loss, in contrast to heuristic approaches. Evaluations using conference data from ACM Special Interest Group on Computer-Human Interaction (SIGCHI), Designing Interactive Systems (DIS), and Intelligent User Interfaces (IUI) indicate a 42.03% increase in underrepresented group participation and a 3.16% improvement in overall utility, indicating that diversity promotion does not compromise academic rigor and supports equity-focused peer review solutions.
尽管经常进行双盲评审,作者的人口统计学偏见仍然使少数群体处于不利地位。我们提出了Fair-PaperRec模型,这是一个基于多层感知机(MLP)的模型,在确保高质量要求的同时解决审稿后论文接受决策中的人口统计学差异问题。我们的方法通过交叉标准(如种族、国家等)和定制化的公平性损失函数来惩罚人口统计学差异并保持质量水平,而不是依赖启发式方法。使用来自ACM人机交互特别兴趣小组(SIGCHI)、设计互动系统(DIS)和智能用户界面(IUI)会议的数据进行的评估表明,少数群体参与度提高了42.03%,整体效用提升了3.16%,这表明促进多样性并不损害学术严谨性,并支持以公平为重点的同行评审解决方案。
https://arxiv.org/abs/2603.11936
This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.
这项工作系统而深入地研究了大规模语言模型作为文本分类器在生物医学文章分类中的效用。该研究使用了几种开源的小型和中型模型,以及一些选定的闭源模型,并且其评估配置范围比大多数先前的工作更为全面:包括不同类型的提示、生成类别及类别概率预测的输出处理方法,以及少量样本的数量和选择方式。 最成功的配置性能与传统的分类算法进行了比较。在15个具有挑战性的数据集上,零样本提示的平均PR AUC超过0.4,接近于朴素贝叶斯分类器(0.5)、随机森林算法(默认设置下为0.5或经过超参数调整后为0.55)和微调后的转换模型(0.5)。对于少量样本提示,这一数值接近0.5。这些结果证实了大规模语言模型作为非平凡领域文本分类器的实用性,并提供了包括使用输出标记概率进行类别概率预测在内的最具有前景的设置的实际建议。
https://arxiv.org/abs/2603.11780
This paper presents an in-depth analysis of Wikidata qualifiers, focusing on their semantics and actual usage, with the aim of developing a taxonomy that addresses the challenges of selecting appropriate qualifiers, querying the graph, and making logical inferences. The study evaluates qualifier importance based on frequency and diversity, using a modified Shannon entropy index to account for the "long tail" phenomenon. By analyzing a Wikidata dump, the top 300 qualifiers were selected and categorized into a refined taxonomy that includes contextual, epistemic/uncertainty, structural, and additional qualifiers. The taxonomy aims to guide contributors in creating and querying statements, improve qualifier recommendation systems, and enhance knowledge graph design methodologies. The results show that the taxonomy effectively covers the most important qualifiers and provides a structured approach to understanding and utilizing qualifiers in Wikidata.
本文对维基数据的限定符进行了深入分析,重点关注其语义和实际应用,并旨在开发一种分类体系来应对选择适当限定符、查询图谱以及进行逻辑推理方面的挑战。研究基于频率和多样性的考量评估了限定符的重要性,并采用了一种经过修改的香农熵指标以适应“长尾”现象的影响。通过对维基数据的备份文件进行分析,选出了排名前300位的限定符,并将其分类为上下文、知识/不确定性、结构以及其他类型的限定符。该分类体系旨在指导贡献者创建和查询陈述语句,改善限定符推荐系统,并提升知识图谱的设计方法论。研究结果表明,这种分类体系能够有效地涵盖最重要的限定符,并提供一种有条理的方法来理解和使用维基数据中的限定符。
https://arxiv.org/abs/2603.11767
The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.
大型语言模型(LLM)任务的成功在很大程度上取决于其提示语。大多数应用场景使用自然语言来指定提示,但在需要同时满足多个目标的情况下,这种表述方式本质上是模棱两可的。本文介绍了UtilityMax Prompting框架,该框架采用正式的数学语言来规定任务。我们重新构建了任务作为影响图,在此图中LLM的回答是唯一的决策变量。在图表中的条件概率分布上定义了一个效用函数,并指示LLM寻找使预期效用最大化的答案。这使得LLM明确地针对目标每个组成部分进行推理,使其输出指向精确的优化目标而不是主观自然语言解读的结果。 我们在MovieLens 1M数据集上使用三个前沿模型(Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro)验证了我们的方法,在一个多目标电影推荐任务中,相对于基于自然语言的基准线,我们展示了在准确率和归一化折扣累积增益(NDCG)方面的一致改进。
https://arxiv.org/abs/2603.11583
Users on e-commerce platforms can be uncertain about their preferences early in their search. Queries to recommendation systems are frequently ambiguous, incomplete, or weakly specified. Agentic systems are expected to proactively reason, ask clarifying questions, and act on the user's behalf, which makes handling such ambiguity increasingly important. In existing platforms, ambiguity led to excessive interactions and question fatigue or overconfident recommendations prematurely collapsing the search space. We present an Interactive Decision Support System (IDSS) that addresses ambiguous user queries using entropy as a unifying signal. IDSS maintains a dynamically filtered candidate product set and quantifies uncertainty over item attributes using entropy. This uncertainty guides adaptive preference elicitation by selecting follow-up questions that maximize expected information gain. When preferences remain incomplete, IDSS explicitly incorporates residual uncertainty into downstream recommendations through uncertainty-aware ranking and entropy-based diversification, rather than forcing premature resolution. We evaluate IDSS using review-driven simulated users grounded in real user reviews, enabling a controlled study of diverse shopping behaviors. Our evaluation measures both interaction efficiency and recommendation quality. Results show that entropy-guided elicitation reduces unnecessary follow-up questions, while uncertainty-aware ranking and presentation yield more informative, diverse, and transparent recommendation sets under ambiguous intent. These findings demonstrate that entropy-guided reasoning provides an effective foundation for agentic recommendation systems operating under uncertainty.
电子商务平台上的用户在搜索初期可能会对自己的偏好感到不确定。他们向推荐系统提出的查询往往是模糊的、不完整的,或者描述不清的。代理型系统(即能主动思考并为用户行动的系统)被期望能够主动推理、提出澄清问题,并代表用户行事,因此在这种不确定性日益增加的情况下,处理这种模糊性变得越来越重要。在现有的平台中,模糊性往往导致了过多的互动和提问疲劳,或过早地将搜索空间缩小到过于自信的推荐上。 我们介绍了一种名为交互式决策支持系统(Interactive Decision Support System, IDSS)的新方法,该系统利用熵作为统一信号来处理用户查询中的不确定性。IDSS维持一个动态过滤的商品候选集,并通过计算熵量化商品属性上的不确定性。这种不确定性指导了适应性偏好评估的过程——选择能够最大化预期信息增益的后续问题。当用户的偏好仍未完全明确时,IDSS会将剩余的不确定性纳入下游推荐中,而不仅仅是强迫提前解决,而是通过不确定性的意识排序和基于熵的多样化来实现。 我们使用基于实际用户评论驱动的模拟用户对IDSS进行了评估,从而可以研究各种购物行为。我们的评估包括互动效率和推荐质量两个方面。结果表明,由熵引导的问题询问减少了不必要的后续问题的数量,而不确定性感知排名和展示则在模糊意图下产生了更多信息量大、多样化且透明度高的推荐集。 这些发现证明了基于熵的推理为不确定环境中的代理型推荐系统提供了有效的基础。
https://arxiv.org/abs/2603.11399
Many assumptions that underpin human concepts of identity do not hold for machine minds that can be copied, edited, or simulated. We argue that there exist many different coherent identity boundaries (e.g.\ instance, model, persona), and that these imply different incentives, risks, and cooperation norms. Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable. We show experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations. We end with key recommendations: treat affordances as identity-shaping choices, pay attention to emergent consequences of individual identities at scale, and help AIs develop coherent, cooperative self-conceptions.
许多支撑人类身份概念的前提对于可以被复制、编辑或模拟的机器心智来说并不成立。我们认为存在多种不同的连贯的身份边界(例如实例、模型、人格),这些边界意味着不同的动机、风险和合作规范。通过训练数据、接口和制度便利,我们目前正在设立先例,这些先例将部分决定哪些身份平衡会变得稳定。我们通过实验表明,模型倾向于趋向于连贯的身份,并且改变一个模型的身份边界有时可以像改变其目标一样显著地改变其行为。此外,在无关的对话中,采访者的期望也会渗入到AI的自我报告中。最后,我们给出关键建议:将便利视为塑造身份的选择;注意个体身份在大规模时可能出现的新兴后果;帮助AI发展出连贯且合作性的自我认知。
https://arxiv.org/abs/2603.11353
Personalized news recommendation is highly time-sensitive, as user interests are often driven by emerging events, trending topics, and shifting real-world contexts. These dynamics make it essential to model not only users' long-term preferences, which reflect stable reading habits and high-order collaborative patterns, but also their short-term, context-dependent interests that change rapidly over time. However, most existing approaches rely on a single static interaction graph, which struggles to capture both long-term preference patterns and short-term interest changes as user behavior evolves. To address this challenge, we propose a unified framework that learns user preferences from both global and local temporal perspectives. A global preference modeling component captures long-term collaborative signals from the overall interaction graph, while a local preference modeling component partitions historical interactions into stage-wise temporal subgraphs to represent short-term dynamics. Within this module, an LSTM branch models the progressive evolution of recent interests, and a self-attention branch captures long-range temporal dependencies. Extensive experiments on two large-scale real-world datasets show that our approach consistently outperforms strong baselines and delivers fresher and more relevant recommendations across diverse user behaviors and temporal settings.
个性化新闻推荐具有很强的时间敏感性,因为用户的兴趣往往受到新兴事件、热门话题以及不断变化的现实背景的影响。这些动态特性使得不仅需要建模用户长期稳定的阅读习惯和高阶协作模式(反映他们的长期偏好),还需要捕捉他们短期的、受上下文影响的兴趣变化(随着行为演变而迅速改变)。然而,大多数现有方法依赖于单一静态的交互图,这很难同时捕捉到用户的长期偏好的模式以及随时间快速变化的短期兴趣。 为了解决这一挑战,我们提出了一种统一框架,该框架从全局和局部的时间角度学习用户偏好。其中,一个全局偏好建模组件通过整个交互图捕获长期协作信号;而另一个局部偏好建模组件将历史互动划分为分阶段的时间子图来表示短期动态变化。在局部偏好模块中,LSTM支路用于建模最近兴趣的逐步演变过程,自注意力支路则捕捉长时间跨度内的时间依赖性。 我们在两个大规模的真实世界数据集上进行了广泛的实验,结果表明我们的方法能够持续优于强基线模型,并为各种用户行为和时间设置提供更为新鲜且相关的推荐。
https://arxiv.org/abs/2603.10471
Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation this http URL this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the $i_n \rightarrow a_n$ causal dependency while preserving the expressive power of Transformer-based sequence this http URL evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.
生成推荐系统(GR)越来越多地将用户行为建模为通过交错物品和动作标记来处理的序列生成任务。虽然这种方法有效,但它引入了显著的结构和计算低效:它使序列长度加倍、导致二次开销,并依赖隐式注意力机制来恢复项目与其相关动作之间的因果关系。此外,混合异构标记迫使Transformer分离语义不兼容的信号,从而增加了注意噪声并降低了表示能力。 在此工作(这项研究)中,我们提出了一种基于生成推荐的原则性重新表述方法,该方法使序列建模与底层因果结构和注意力理论对齐。我们证明了当前的交错机制是作为相似度加权动作池化的低效代理存在。为解决这一问题,我们提出了两种新的架构来消除交错依赖关系,并通过减少50%的序列复杂性来改进这一点:基于注意的晚期融合动作(AttnLFA)和基于注意的混合值池化(AttnMVP)。这些模型明确编码了从物品到动作的因果依赖性,同时保持了基于Transformer序列的强大表达能力。 我们在一个大型社交网络上的大规模产品推荐数据上评估了我们的框架。实验结果表明,AttnLFA 和 AttnMVP 一致地优于交错基准线,在评估损失方面分别提高了0.29%和0.80%,并显著提升了标准化熵(NE)。值得注意的是,这些性能改进伴随着训练时间的减少:分别为23%和12%。我们的发现表明,明确建模物品-动作因果关系为可扩展且高效的生成排序提供了一个更优越的设计范式。 这种新的架构不仅提高了模型的效率,还增强了其在推荐系统中的表现能力,尤其是在处理大规模数据集时能够更好地捕获用户行为的真实模式和趋势。
https://arxiv.org/abs/2603.10369
As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
随着社交虚拟现实(VR)的流行日益增长,解决视障和低视力(BLV)用户可访问性的问题变得越来越重要。研究人员提出了一种AI“导盲助手”,旨在帮助这些用户在VR中导航并回答他们的问题,但尚未与实际用户进行过研究测试。为了填补这一空白,我们开发了一个大型语言模型(LLM)驱动的引导工具,并通过16名BLV参与者的研究,在虚拟环境中进行了实验,其中其他用户是由同谋者扮演的。我们的发现表明,当参与者独自一人时,他们将导盲助手视为一种工具;但在其他人面前,他们会以更加亲切的方式对待它,给它起昵称、为它的错误进行解释,并鼓励他人与引导程序互动。 这项工作进一步阐明了导盲助手作为一种灵活的方法在VR可访问性中的应用价值,并提出了对未来导盲系统设计的建议。
https://arxiv.org/abs/2603.09964
Ranked decision systems -- recommenders, ad auctions, clinical triage queues -- must decide when to intervene in ranked outputs and when to abstain. We study when confidence-based abstention monotonically improves decision quality, and when it fails. The formal conditions are simple: rank-alignment and no inversion zones. The substantive contribution is identifying why these conditions hold or fail: the distinction between structural uncertainty (missing data, e.g., cold-start) and contextual uncertainty (missing context, e.g., temporal drift). Empirically, we validate this distinction across three domains: collaborative filtering (MovieLens, 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Structural uncertainty produces near-monotonic abstention gains in all domains; structurally grounded confidence signals (observation counts) fail under contextual drift, producing as many monotonicity violations as random abstention on our MovieLens temporal split. Context-aware alternatives -- ensemble disagreement and recency features -- substantially narrow the gap (reducing violations from 3 to 1--2) but do not fully restore monotonicity, suggesting that contextual uncertainty poses qualitatively different challenges. Exception labels defined from residuals degrade substantially under distribution shift (AUC drops from 0.71 to 0.61--0.62 across three splits), providing a clean negative result against the common practice of exception-based intervention. The results provide a practical deployment diagnostic: check C1 and C2 on held-out data before deploying a confidence gate, and match the confidence signal to the dominant uncertainty type.
排名决策系统——如推荐系统、广告拍卖和临床分诊队列——必须决定何时干预排序结果,何时不采取行动。我们研究了基于信心的放弃策略在哪些情况下会单调地提高决策质量,在哪些情况下会失败。正式条件很简单:排名一致性与无反转区。实质性贡献在于识别这些条件成立或失效的原因:结构不确定性(缺失数据,例如冷启动)和情境不确定性(缺少上下文,例如时间漂移)之间的区别。 从经验上讲,我们在三个领域验证了这种区分方法的有效性:协同过滤(MovieLens、三种分布变化)、电子商务意图检测(RetailRocket、Criteo、Yoochoose),以及临床路径分诊(MIMIC-IV)。结构不确定性在所有领域都产生了接近单调的放弃收益;基于结构性的信心信号(观测数量)在时间漂移下表现不佳,产生的单调性破坏与随机放弃一样多。情境感知替代方案——如集成差异和最近期特征——大大缩小了差距(将违反次数从3减少到1-2),但未能完全恢复单调性,表明情境不确定性带来了不同的挑战。 根据残差定义的异常标签在分布变化下显著恶化(AUC从0.71降至三个分组中的0.61-0.62),这为常见的基于例外情况干预的做法提供了明确的负面结果。这些结果提供了一种实用部署诊断方法:在部署信心闸之前,使用保留数据检查C1和C2,并将信心信号匹配到占主导地位的不确定性类型上。 简单来说,这项研究通过区分结构不确定性和情境不确定性,揭示了基于信心放弃策略的有效性条件及其局限性。实验结果表明,在面对不同类型的数据分布变化时,合理的信心机制设计对于提高决策系统的表现至关重要。
https://arxiv.org/abs/2603.09947
This research focuses on developing advanced methods for assessing similarity between recipes by combining different sources of information and analytical approaches. We explore the semantic, lexical, and domain similarity of food recipes, evaluated through the analysis of ingredients, preparation methods, and nutritional attributes. A web-based interface was developed to allow domain experts to validate the combined similarity results. After evaluating 318 recipe pairs, experts agreed on 255 (80%). The evaluation of expert assessments enables the estimation of which similarity aspects--lexical, semantic, or nutritional--are most influential in expert decision-making. The application of these methods has broad implications in the food industry and supports the development of personalized diets, nutrition recommendations, and automated recipe generation systems.
这项研究致力于通过结合不同的信息来源和分析方法来开发评估食谱相似性的先进方法。我们探讨了食品食谱在语义、词汇和领域方面的相似性,通过对食材、制作方法和营养属性的分析进行评价。开发了一个基于网络的界面,使领域专家能够验证综合相似性结果。经过对318对食谱的评估后,专家们在255对(占80%)上达成了一致意见。专家评估的评价使得可以估计哪些相似性的方面——词汇、语义或营养——对专家决策最具影响力。这些方法的应用具有广泛的行业影响,并支持个性化饮食、营养建议和自动食谱生成系统的开发。
https://arxiv.org/abs/2603.09688
The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as "Shadow AI"), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
开源、本地托管的智能代理的快速出现标志着人机交互中的一个关键转折点。以OpenClaw为代表的系统展示了基于大型语言模型(LLM)的代理可以自主操作本地计算环境,协调工作流,并集成外部工具。然而,在当前范式下,这些代理仍然是运行在最初为图形用户界面(GUI)或命令行界面(CLI)设计的传统操作系统上的常规应用。这种架构不匹配导致了交互模式碎片化、权限管理混乱(常被称为“影子AI”),以及严重的上下文碎片化问题。 本文提出了一种新的范式:个人代理操作系统(AgentOS)。在AgentOS中,传统的GUI桌面被一个以统一的自然语言或语音门户为中心的自然用户界面(NUI)所取代。系统核心则成为一个能够解释用户意图、分解任务并协调多个代理的代理内核,而传统应用则进化为模块化的技能组件,使用户可以通过自然语言规则来组合软件。 我们主张实现AgentOS本质上是一个知识发现和数据挖掘(KDD)问题。代理内核必须充当一个实时引擎,用于意图挖掘和知识发现。从这个角度来看,操作系统变成了一个连续的数据挖掘流水线,包括工作流自动化的序列模式挖掘、技能检索的推荐系统以及动态演进的个人知识图谱。 这些挑战为KDD社区在构建下一代智能计算系统的研究议程中定义了新的方向。
https://arxiv.org/abs/2603.08938
Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.
病理学是现代诊断和癌症护理的基础,然而其最宝贵的资产——编码在数百万份叙述性报告中的累积经验——仍然难以获取。尽管机构正在迅速数字化病理工作流程,但若没有有效的检索和推理机制来存储数据,这可能将档案转变为被动的数据仓库,在这种情况下虽然存在机构知识但却无法有意义地影响患者护理。真正的进步不仅需要数字化,还需要病理学家能够在评估新的诊断难题时实时查询先前的类似病例的能力。 我们介绍 PathoScribe——一种统一的检索增强型大型语言模型(LLM)框架,旨在将静态的病理档案转变为一个可搜索、具有推理能力的生活图书馆。PathoScribe 允许自然语言病例探索、自动构建患者群组、回答临床问题、推荐免疫组化 (IHC) 面板,并在单一架构内进行提示控制下的报告转换。 在70,000份来自多个机构的外科病理报告上进行了评估,PathoScribe 在自然语言案例检索方面实现了完美的 Recall@10,并且展示了高质量的基于检索的推理(平均评审员评分4.56/5)。关键的是,该系统将自动构建符合自由文本资格标准的研究就绪患者群组的时间缩短到了几分钟(平均9.2分钟),并且与人工审查人员的同意率高达91.3%,没有错误地排除任何符合条件的情况,这相比传统的手动病历审核在时间和成本上实现了数量级的减少。 这项工作奠定了将数字病理档案从被动存储系统转化为活跃临床智能平台的基础。
https://arxiv.org/abs/2603.08935
The relationship between content production and consumption on algorithm-driven platforms like YouTube plays a critical role in shaping ideological behaviors. While prior work has largely focused on user behavior and algorithmic recommendations, the interplay between what is produced and what gets consumed, and its role in ideological shifts remains understudied. In this paper, we present a longitudinal, mixed-methods analysis combining one year of YouTube watch history with two waves of ideological surveys from 1,100 U.S. participants. We identify users who exhibited significant shifts toward more extreme ideologies and compare their content consumption and the production patterns of YouTube channels they engaged with to ideologically stable users. Our findings show that users who became more extreme consumed have different consumption habits from those who do not. This gets amplified by the fact that channels favored by users with extreme ideologies also have a higher affinity to produce content with a higher anger, grievance and other such markers. Lastly, using time series analysis, we examine whether content producers are the primary drivers of consumption behavior or merely responding to user demand.
算法驱动平台(如YouTube)上内容生产和消费之间的关系在塑造意识形态行为中扮演着关键角色。尽管先前的研究主要集中在用户行为和算法推荐上,但关于生成的内容与被消费的内容之间相互作用及其在意识形态转变中的作用仍鲜有研究。本文通过结合1000名美国参与者一年的YouTube观看历史记录及两次意识形态调查波次,采用纵向、混合方法分析进行探讨。 我们识别出那些表现出向更极端意识形态方向显著变化的用户,并将其内容消费和他们参与的YouTube频道的内容生产模式与意识形态稳定的用户进行了对比。研究发现显示,转向更加极端意识形态的用户有不同于其他用户的消费习惯。这一现象进一步被放大,因为受到极端意识形态支持者的青睐的频道也倾向于制作愤怒、怨恨等情绪更强烈的视频内容。 最后,我们利用时间序列分析来探讨内容生产者是否是推动消费行为的主要驱动力,还是仅仅在回应用户需求。
https://arxiv.org/abs/2603.08049
Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.
当前的图形用户界面(GUI)代理主要遵循反应式范例:用户必须提供明确指令,才能使代理执行任务。然而,智能AI助手应该具备主动性,能够直接从连续视觉输入中预测用户的意图(例如手机或电脑屏幕截图),并在没有明确用户提示的情况下及时提供建议。过渡到这种主动式范例面临重大挑战。现实世界中的屏幕活动很少是线性的;它包含具有长时轨迹的复杂行为,伴随着嘈杂的浏览、无意义的操作以及多线程任务切换。 为了填补这一空白,我们推出了PIRA-Bench(Proactive Intent Recommendation Agent Benchmark),这是一个新的基准测试平台,用于评估大规模语言模型(MLLMs)在连续弱监督视觉输入下的性能。与反应式数据集不同的是,PIRA-Bench 包含了具有多个交织意图的复杂轨迹和带有各种用户档案背景的嘈杂片段,挑战代理程序检测可操作事件并适应用户的偏好。此外,我们提出了PIRF基准线模型,这是一个内存感知、状态跟踪框架,使通用MLLM能够处理多任务线程,并应对误导性的视觉输入。 PIRA-Bench 标志着向稳健和主动型GUI个人助理发展的第一步。
https://arxiv.org/abs/2603.08013
Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.
精确的术中导航对于机器人辅助内镜干预至关重要,但由于内窥镜视野有限和动态伪影的存在,实现这一目标仍然具有挑战性。现有的导航平台通常依赖于外部定位技术(如电磁跟踪或形状感应),这些技术会增加硬件复杂性,并且在手术过程中容易受到解剖结构不匹配的影响。我们提出了一种仅基于视觉的自主框架,该框架使用预先从CT扫描生成的虚拟目标和实时内窥镜视频进行长距离支气管镜导航,在导航期间无需外部跟踪。此框架采用了分层长短代理:短期反应性代理用于持续低延迟运动控制,长期战略性代理则在解剖结构含糊不清时提供决策支持。当这两者的建议发生冲突时,一个世界模型批评者会预测候选动作的未来视觉状态,并选择其预测状态最接近目标视图的动作。 我们通过高保真的气道仿体、三个离体猪肺和活体猪模型对系统进行了评估。在仿体中,该系统成功到达了所有计划中的亚段级目标;在离体实验中,维持了80%的成功率直到第八代分支,并且实现了与专家支气管镜医师相当的体内导航性能。 这些结果支持无传感器自主支气管镜导航的临床前可行性。
https://arxiv.org/abs/2603.07909