Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost-sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real-world: how to merge generative recommenders specialized to different real-world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine-tuned on context-specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task-aware and context-specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context-dependent interaction characteristics, offering practical guidance for weight selection in real-world deployments.
模型合并(MM)提供了一种高效机制,可在不访问原始训练数据或重新训练的情况下集成多个专业模型。尽管在计算机视觉等领域中MM已经展示了成功案例,但在推荐系统(RSs)中的作用仍然很大程度上未被探索。最近,生成式推荐(GR)作为推荐系统的新范例出现,其特点是由快速扩大的模型规模和显著的计算成本所驱动,这使得对于成本敏感部署场景而言,MM特别具有吸引力。 在此研究中,我们首次通过情境视角系统性地探讨了MM在GR中的应用。我们专注于现实世界中一个基本但鲜被探索的挑战:如何合并针对不同真实世界情景专业化的生成推荐器,这些情景源于用户行为的时间演变和异构的应用领域。为此,我们提出了一种统一框架MMGRid,这是一个由时间演化和领域多样性所诱导的不同情境中的GR检查点组成的结构化上下文网格,所有的检查点都源自于一个共享的基础LLM(大型语言模型),但经过特定情境数据的微调,形成了一个现实且可控的模型空间,可用于系统性地分析不同GR范式及合并算法间的MM。 我们的研究揭示了几个关键见解。首先,在从基础LLM训练GR模型时,由于标记分布偏移和目标差异等原因,在合并过程中可能会引入参数冲突;通过使用基于任务感知与上下文特定参数变化的基模型替换来拆分这些变化可以减轻这种冲突。其次,跨情境进行增量训练会导致近期偏差问题,可以通过加权上下文合并有效地平衡这种倾向。值得注意的是,我们观察到最佳合并权重会随着交互特征在不同上下文中依赖性而有所不同,这为实际部署中的权重选择提供了实用指导。 通过这项工作,我们旨在揭示生成式推荐系统领域模型合并的关键挑战与机遇,并为进一步优化其应用提供理论基础和实践建议。
https://arxiv.org/abs/2601.15930
Understanding what users like is relatively straightforward; understanding what users dislike, however, remains a challenging and underexplored problem. Research into users' negative preferences has gained increasing importance in modern recommendation systems. Numerous platforms have introduced explicit negative feedback mechanisms and leverage such signals to refine their recommendation models. Beyond traditional business metrics, user experience-driven metrics, such as negative feedback rates, have become critical indicators for evaluating system performance. However, most existing approaches primarily use negative feedback as an auxiliary signal to enhance positive recommendations, paying little attention to directly modeling negative interests, which can be highly valuable in offline applications. Moreover, due to the inherent sparsity of negative feedback data, models often suffer from context understanding biases induced by positive feedback dominance. To address these challenges, we propose the first large language model framework for negative feedback modeling with special designed context-discerning modules. We use semantic ID Representation to replace text-based item descriptions and introduce an item-level alignment task that enhances the LLM's understanding of the semantic context behind negative feedback. Furthermore, we design a Progressive GRPO training paradigm that enables the model to dynamically balance the positive and negative behavioral context utilization. Besides, our investigation further reveals a fundamental misalignment between the conventional next-negative-item prediction objective and users' true negative preferences, which is heavily influenced by the system's recommendation order. To mitigate this, we propose a novel reward function and evaluation metric grounded in multi-day future negative feedback and their collaborative signals.
理解用户喜欢的东西相对简单;然而,理解用户不喜欢的东西仍然是一个挑战性大且研究不足的问题。现代推荐系统中,对用户负面偏好的研究越来越重要。许多平台已经引入了明确的负面反馈机制,并利用这些信号来优化他们的推荐模型。除了传统的业务指标之外,以用户体验为导向的指标(如负面反馈率)已成为评估系统性能的关键指标。然而,大多数现有方法主要将负面反馈用作辅助信号,用于增强正面推荐,而很少直接对负面兴趣建模,在离线应用中这可能非常有价值。此外,由于负面反馈数据固有的稀疏性,模型往往因正向反馈占主导地位而导致上下文理解偏差。 为了解决这些挑战,我们提出了第一个专门针对负面反馈建模的大型语言模型框架,并设计了专门用于识别不同上下文的模块。我们使用语义ID表示法替代基于文本的项目描述,并引入了一个项目级别的对齐任务,以增强LLM(Large Language Model)理解负面反馈背后的语义背景的能力。此外,我们设计了一种渐进式GRPO训练范式,使模型能够动态地平衡正向和负向行为上下文的利用。 进一步的研究还揭示了传统的下一个负面项目预测目标与用户真实的负面偏好之间存在基本不一致的问题,这在很大程度上受到系统推荐顺序的影响。为解决这一问题,我们提出了一种基于多日未来负面反馈及其协同信号的新颖奖励函数和评估指标。
https://arxiv.org/abs/2601.15721
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉-语言模型(VLMs)在不进行特定任务训练的情况下提供了零样本预测,然而这些模型在多属性时尚任务中的系统性评估尚处于探索阶段。一个关键挑战在于时尚属性往往是条件性的:例如,“外层面料”这一属性仅在外穿衣物可见时才具有定义。这就要求模型首先检测属性是否适用再进行分类。 我们引入了一个三级评价框架来分解这个难题: 1. 跨所有类别的整体任务表现(包括NA类别,表示该属性不适用)。 2. 属性适用性检测。 3. 在可确定的情况下进行细粒度分类。 利用DeepFashion-MultiModal数据集,该数据集在属性标签空间中明确定义了NA(表示该属性不存在或不可见),我们对九种VLMs进行了基准测试,这些模型涵盖了旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超级高效的级别(GPT-5 Nano, Gemini 2.5 Flash-Lite),并且对比了基于预训练Fashion-CLIP嵌入的分类器在跨18个属性、5000张图像上的表现。 我们的发现表明: 1. 零样本VLMs达到了64.0%的宏F1分数,相较于基于预训练Fashion-CLIP嵌入的逻辑回归模型,有三倍以上的改进。 2. VLMs在细粒度分类(第三级:70.8% F1)方面表现出色,但在适用性检测(第二级:34.1% NA-F1)上表现不佳,这揭示了一个关键瓶颈。 3. 高效模型在较低成本下实现了旗舰性能的90%以上,为实用部署提供了路径。 这一诊断框架使实践者能够确定错误是源于可见性检测还是分类,并指导生产系统进行针对性改进。
https://arxiv.org/abs/2601.15711
Contemporary sequential recommendation methods are becoming more complex, shifting from classification to a diffusion-guided generative paradigm. However, the quality of guidance in the form of user information is often compromised by missing data in the observed sequences, leading to suboptimal generation quality. Existing methods address this by removing locally similar items, but overlook ``critical turning points'' in user interest, which are crucial for accurately predicting subsequent user intent. To address this, we propose a novel Counterfactual Attention Regulation Diffusion model (CARD), which focuses on amplifying the signal from key interest-turning-point items while concurrently identifying and suppressing noise within the user sequence. CARD consists of (1) a Dual-side Thompson Sampling method to identify sequences undergoing significant interest shift, and (2) a counterfactual attention mechanism for these sequences to quantify the importance of each item. In this manner, CARD provides the diffusion model with a high-quality guidance signal composed of dynamically re-weighted interaction vectors to enable effective generation. Experiments show our method works well on real-world data without being computationally expensive. Our code is available at this https URL.
当代的顺序推荐方法正变得越来越复杂,从分类转向一种由引导扩散生成的范式。然而,在观察到的序列中缺失的数据往往会影响用户信息的质量指导,从而导致生成质量低下。现有的解决方法是通过移除局部相似项目来处理这个问题,但它们忽略了“关键转折点”,这些是在准确预测后续用户意图时至关重要的兴趣转变时刻。 为了解决这一问题,我们提出了一种新的反事实注意力调节扩散模型(CARD),该模型专注于放大从关键兴趣转折点物品中获取的信号,并同时识别并抑制序列中的噪声。CARD包括以下两个组成部分: 1. 双侧汤普森抽样方法:用于确定正在进行显著兴趣转变的序列; 2. 一种针对这些序列的反事实注意力机制,用以量化每个项目的相对重要性。 通过这种方式,CARD为扩散模型提供了一种高质量的指导信号,即动态重新加权后的交互向量,从而实现有效生成。实验表明,我们的方法在真实世界数据上运行良好,并且计算成本不高。代码可在提供的链接中获取。
https://arxiv.org/abs/2601.15673
PURPOSE OR GOAL: This study investigates how GenAI can be integrated with a criterion-referenced grading framework to improve the efficiency and quality of grading for mathematical assessments in engineering. It specifically explores the challenges demonstrators face with manual, model solution-based grading and how a GenAI-supported system can be designed to reliably identify student errors, provide high-quality feedback, and support human graders. The research also examines human graders' perceptions of the effectiveness of this GenAI-assisted approach. ACTUAL OR ANTICIPATED OUTCOMES: The study found that GenAI achieved an overall grading accuracy of 92.5%, comparable to two experienced human graders. The two researchers, who also served as subject demonstrators, perceived the GenAI as a helpful second reviewer that improved accuracy by catching small errors and provided more complete feedback than they could manually. A central outcome was the significant enhancement of formative feedback. However, they noted the GenAI tool is not yet reliable enough for autonomous use, especially with unconventional solutions. CONCLUSIONS/RECOMMENDATIONS/SUMMARY: This study demonstrates that GenAI, when paired with a structured, criterion-referenced framework using binary questions, can grade engineering mathematical assessments with an accuracy comparable to human experts. Its primary contribution is a novel methodological approach that embeds the generation of high-quality, scalable formative feedback directly into the assessment workflow. Future work should investigate student perceptions of GenAI grading and feedback.
**研究目的或目标:** 本研究探讨了如何将通用人工智能(GenAI)与基于标准的评分框架相结合,以提高工程数学评估中的评分效率和质量。该研究特别关注手动评分者在使用模型解决方案进行评分时面临的挑战,并探索设计一个由GenAI支持的系统来可靠地识别学生错误、提供高质量反馈以及辅助人类评分者的可能性。此外,该研究还考察了人工评分员对这一基于GenAI的方法有效性的看法。 **实际或预期成果:** 研究表明,GenAI在数学评估中的总体评分准确率为92.5%,与两位经验丰富的手动评分者的表现相当。担任研究对象演示者的两名研究人员认为,GenAI可以作为有效的第二评审人来提高准确性,并提供比他们手动操作时更全面的反馈。其中一项重要成果是形成了更为有效的形成性反馈机制。然而,研究人员也指出,该GenAI工具尚未达到自主使用的可靠性水平,尤其是在处理非常规解决方案时。 **结论/建议/总结:** 本研究表明,在使用结构化、基于标准的方法(特别是采用二元问题)的情况下,将通用人工智能与工程数学评估的评分相结合可以实现与专家级人类评分员相当的准确度。GenAI的主要贡献在于提出了一种新颖的方法学方法,该方法直接在评估流程中嵌入高质量且可扩展的形成性反馈生成机制。未来的研究应该关注学生对基于GenAI的评分和反馈的看法。 --- 这项研究为利用通用人工智能改善工程数学评估中的自动评分提供了重要见解,并强调了进一步开发和完善这一技术以提高其可靠性和适用性的必要性。
https://arxiv.org/abs/2601.15626
Prompting is central to interaction with AI systems, yet many users struggle to explore alternative directions, articulate creative intent, or understand how variations in prompts shape model outputs. We introduce prompt recommender systems (PRS) as an interaction approach that supports exploration, suggesting contextually relevant follow-up prompts. We present PromptHelper, a PRS prototype integrated into an AI chatbot that surfaces semantically diverse prompt suggestions while users work on real writing tasks. We evaluate PromptHelper in a 2x2 fully within-subjects study (N=32) across creative and academic writing tasks. Results show that PromptHelper significantly increases users' perceived exploration and expressiveness without increasing cognitive workload. Qualitative findings illustrate how prompt recommendations help users branch into new directions, overcome uncertainty about what to ask next, and better articulate their intent. We discuss implications for designing AI interfaces that scaffold exploratory interaction while preserving user agency, and release open-source resources to support research on prompt recommendation.
提示推荐系统(PRS)在与AI系统的互动中扮演着重要角色,然而许多用户在探索不同方向、表达创意意图或理解提示变化如何影响模型输出方面存在困难。本文介绍了PRSS作为一种互动方法,旨在支持用户的探索过程,并提供上下文相关的后续提示建议。我们提出了PromptHelper,这是一个整合到AI聊天机器人中的PRS原型,在用户进行实际写作任务时展示语义多样化的提示建议。 我们在一个2x2完全被试内设计的研究中(N=32),评估了PromptHelper在创意和学术写作任务上的表现。结果显示,PromptHelper显著提高了用户的探索感受与表达能力,同时没有增加认知工作量。质性研究结果展示了如何通过推荐提示帮助用户转向新方向、克服关于下一步应该问什么的不确定性,并更好地表达意图。 本文讨论了设计能够支持探索式互动的同时保持用户自主权的AI界面的重要性,并发布了开源资源以促进对提示推荐的研究。
https://arxiv.org/abs/2601.15575
We present BanditLP, a scalable multi-stakeholder contextual bandit framework that unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection at serving time. The methodology is application-agnostic, compatible with arbitrary neural architectures, and deployable at web scale, with an LP solver capable of handling billions of variables. Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. We apply this approach in LinkedIn's email marketing system and demonstrate business win, illustrating the value of integrated exploration and constrained optimization in production.
我们介绍了BanditLP,这是一种可扩展的多利益相关者上下文.bandit框架,它将神经Thompson抽样与大规模线性规划结合在一起,用于学习特定目标的结果,并在服务时间进行受限制的动作选择。该方法是应用无关的,可以与任意神经网络架构兼容,并且可以在网络规模上部署,其LP求解器能够处理数十亿变量。在公共基准和合成数据上的实验显示了相对于强大基线的一致收益。我们在LinkedIn的电子邮件营销系统中应用了这种方法,并展示了业务成果,这证明了集成探索和约束优化在生产中的价值。
https://arxiv.org/abs/2601.15552
Personalized learning systems have emerged as a promising approach to enhance student outcomes by tailoring educational content, pacing, and feedback to individual needs. However, most existing systems remain fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation, but rarely integrating these components into a cohesive adaptive cycle. In this paper, we propose ALIGNAgent (Adaptive Learner Intelligence for Gap Identification and Next-step guidance), a multi-agent educational framework designed to deliver personalized learning through integrated knowledge estimation, skill-gap identification, and targeted resource this http URL begins by processing student quiz performance, gradebook data, and learner preferences to generate topic-level proficiency estimates using a Skill Gap Agent that employs concept-level diagnostic reasoning to identify specific misconceptions and knowledge deficiencies. After identifying skill gaps, the Recommender Agent retrieves preference-aware learning materials aligned with diagnosed deficiencies, implementing a continuous feedback loop where interventions occur before advancing to subsequent topics. Extensive empirical evaluation on authentic datasets from two undergraduate computer science courses demonstrates ALIGNAgent's effectiveness, with GPT-4o-based agents achieving precision of 0.87-0.90 and F1 scores of 0.84-0.87 in knowledge proficiency estimation validated against actual exam performance.
个性化学习系统作为一种有前景的方法,通过根据个人需求定制教育内容、进度和反馈来提高学生的学习成果。然而,大多数现有的系统仍然较为分散,专注于知识追踪、诊断建模或资源推荐中的某一方面,很少将这些组件整合成一个连贯的自适应循环。在这篇论文中,我们提出了ALIGNAgent(用于识别差距并提供下一步指导的自适应学习者智能),这是一个多代理教育框架,旨在通过综合的知识估计、技能缺口识别和针对性的学习资源推荐来提供个性化的学习体验。 该框架首先处理学生的测验表现、成绩册数据以及学生偏好信息,使用技能缺口代理生成基于概念级别诊断推理的主题级熟练度估算。在确定了技能差距之后,推荐代理会检索出符合诊断需求并考虑个人偏好的学习材料,并实施一个持续反馈循环,在进入下一主题之前进行干预。 通过两个本科计算机科学课程的真实数据集的广泛实证评估证明了ALIGNAgent的有效性。基于GPT-4o的代理在知识熟练度估算方面达到了0.87至0.90的精度和0.84至0.87的F1评分,这些结果与实际考试成绩进行了验证。 通过这种方式,ALIGNAgent不仅能够更好地理解每个学生的学习需求和偏好,还能提供更加精确且个性化的学习路径指导,从而显著提高教育效果。
https://arxiv.org/abs/2601.15551
Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.
同行评审是现代科学的核心。随着提交的数量增加和研究社区的扩大,关于评审质量下降的说法变得流行且广泛担忧。然而,这种说法是否属实呢?评审质量难以衡量,并且审稿实践的持续演变使得跨平台和时间点比较评审变得困难。为了解决这一问题,我们引入了一个新的基于证据的研究评审质量的框架,并将其应用于主要的人工智能与机器学习会议:ICLR、NeurIPS 和 *ACL。我们记录了不同形式的评审多样性,并提出了一种新的评审标准化方法。我们还提出了一个多维度模式来量化评审质量作为对编辑和作者的效用,结合了LLM(大型语言模型)和其他轻量级测量方式。我们研究了评审质量度量之间的关系及其随时间的变化趋势。与流行说法相反,我们的跨时间段分析发现,各会议在不同年份的中位数评审质量没有持续下降的趋势。我们提出了替代解释,并概述了未来关于评审质量实证研究的建议。
https://arxiv.org/abs/2601.15172
User interactions on e-commerce platforms are inherently diverse, involving behaviors such as clicking, favoriting, adding to cart, and purchasing. The transitions between these behaviors offer valuable insights into user-item interactions, serving as a key signal for un- derstanding evolving preferences. Consequently, there is growing interest in leveraging multi-behavior data to better capture user intent. Recent studies have explored sequential modeling of multi- behavior data, many relying on transformer-based architectures with polynomial time complexity. While effective, these approaches often incur high computational costs, limiting their applicability in large-scale industrial systems with long user sequences. To address this challenge, we propose the Transition-Aware Graph Attention Network (TGA), a linear-complexity approach for modeling multi-behavior transitions. Unlike traditional trans- formers that treat all behavior pairs equally, TGA constructs a structured sparse graph by identifying informative transitions from three perspectives: (a) item-level transitions, (b) category-level transitions, and (c) neighbor-level transitions. Built upon the structured graph, TGA employs a transition-aware graph Attention mechanism that jointly models user-item interactions and behav- ior transition types, enabling more accurate capture of sequential patterns while maintaining computational efficiency. Experiments show that TGA outperforms all state-of-the-art models while sig- nificantly reducing computational cost. Notably, TGA has been deployed in a large-scale industrial production environment, where it leads to impressive improvements in key business metrics.
电子商务平台上用户与商品的互动是多样的,包括点击、收藏、加入购物车和购买等行为。这些行为之间的转换提供了关于用户对物品兴趣变化的重要见解。因此,越来越多的研究开始关注如何利用多种行为数据来更好地捕捉用户的意图。最近的一些研究探索了序列化建模方法来处理多行为数据,并且许多方法都依赖于具有多项式时间复杂度的Transformer架构。尽管效果显著,这些方法往往伴随着较高的计算成本,在大规模工业系统中面对长用户序列时应用受限。 为了解决这一挑战,我们提出了一种基于图注意力网络(TGA)的方法,该方法能够在建模多行为转换的同时保持线性的时间复杂度。不同于传统的Transformer架构会平等处理所有行为对,TGA通过从三个方面识别有意义的行为转换来构建一个结构化的稀疏图:(a) 商品层面的转换;(b) 类别层面的转换;以及(c) 邻居层面的转换。 基于这样的结构化图,TGA采用了感知过渡的图注意力机制,能够同时建模用户与物品之间的交互和行为类型的变化。这不仅有助于更准确地捕捉序列模式,而且还能保持计算效率。 实验表明,TGA在性能上超越了所有现有的最佳模型,并且显著降低了计算成本。值得注意的是,在大规模工业生产环境中部署后,TGA带来了关键业务指标的显著改进。
https://arxiv.org/abs/2601.14955
Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.
语义ID学习是生成推荐(GR)模型中的关键接口,用于将项目映射到基于侧面信息的离散标识符,通常通过预训练的文本编码器实现。然而,这些文本编码器主要是针对结构良好的自然语言进行优化的。在现实世界的推荐数据中,项目的描述往往具有象征性和属性中心性,包含数字、单位和缩写等元素。这样的文本编码器可能会将这些信号分解成碎片化的标记,从而削弱语义连贯性并扭曲属性之间的关系。更糟糕的是,在转向多模态GR时,依赖标准的文本编码器会引入额外障碍:文本和图像嵌入通常展示出不匹配的几何结构,使得跨模态融合的效果减弱且不稳定。 在这篇论文中,我们重新审视了语义ID学习中的表示设计,并将文本视为视觉信号。我们进行了系统性的实证研究,通过将项目描述渲染成图片并使用基于OCR的视觉模型进行编码来获取基于OCR的文本表示。在跨四个数据集和两种生成后端的实验中,结果表明,OCR文本始终与标准文本嵌入相比,在单模态和多模态设置下对于语义ID学习具有相当或更好的性能。此外,我们发现,基于OCR的语义ID即使在极端的空间分辨率压缩情况下仍然保持稳健性,这表明其在实际部署中具备强大的鲁棒性和效率。
https://arxiv.org/abs/2601.14697
Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect while respecting pedagogical principles and operational constraints. Although large language models (LLMs) offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging due to (i) misalignment with pedagogical objectives such as the Zone of Proximal Development (ZPD) under sparse, delayed feedback, (ii) scarce and costly expert demonstrations, and (iii) multi-objective interactions among learning effect, difficulty scheduling, length controllability, and trajectory diversity. To address these issues, we propose IB-GRPO (Indicator-Based Group Relative Policy Optimization), an indicator-guided alignment approach for LLM-based LPR. To mitigate data scarcity, we construct hybrid expert demonstrations via Genetic Algorithm search and teacher RL agents and warm-start the LLM with supervised fine-tuning. Building on this warm-start, we design a within-session ZPD alignment score for difficulty scheduling. IB-GRPO then uses the $I_{\epsilon+}$ dominance indicator to compute group-relative advantages over multiple objectives, avoiding manual scalarization and improving Pareto trade-offs. Experiments on ASSIST09 and Junyi using the KES simulator with a Qwen2.5-7B backbone show consistent improvements over representative RL and LLM baselines.
学习路径推荐(LPR)的目标是生成个性化学习项目序列,以最大化长期的学习效果,同时遵循教育原则和操作约束。虽然大型语言模型(LLM)提供了丰富的语义理解来进行开放式建议,但将其应用于长周期的LPR仍面临挑战,这主要是由于: 1. 在稀疏、延迟反馈的情况下,与如最近发展区(ZPD)这样的教育目标存在不一致; 2. 专家示范稀缺且成本高昂; 3. 学习效果、难度调度、长度可控性和轨迹多样性之间的多目标互动复杂。 为了应对这些问题,我们提出了IB-GRPO(基于指标的群组相对政策优化),这是一种用于LLM基础LPR的指示符引导对齐方法。为了解决数据稀缺性问题,我们通过遗传算法搜索和教师强化学习代理来构建混合专家示范,并使用监督微调预热LLM。在此基础上,我们设计了一种会话内ZPD对齐分数来进行难度调度。 IB-GRPO利用$I_{\epsilon+}$支配指标计算多个目标下的群组相对优势,避免了手动标量化并改善了帕累托权衡。 在使用Qwen2.5-7B骨干的KES模拟器上进行ASSIST09和Junyi的数据集实验显示,IB-GRPO相对于代表性的强化学习(RL)和LLM基准方法有持续改进。
https://arxiv.org/abs/2601.14686
This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website this https URL.
这份报告总结了美国国家科学基金会(NSF)于2024年12月10日在温哥华举行的关于电子设计自动化(EDA)领域人工智能(AI)研讨会的讨论和建议,该研讨会与NeurIPS 2024同期举行。来自机器学习和EDA领域的专家齐聚一堂,探讨了包括大规模语言模型(LLMs)、图神经网络(GNNs)、强化学习(RL)、神经符号方法等在内的多种AI技术如何促进EDA工作并缩短设计周期。 此次研讨会涵盖了四个主题:(1) AI在物理综合与制造设计(DFM)中的应用,讨论了制造过程的挑战及潜在的人工智能应用场景;(2) 高级和逻辑级别的合成(HLS/LLS),包括插入pragma、程序转换、RTL代码生成等内容;(3) 用于优化和设计的AI工具箱,探讨了前沿人工智能技术可能应用于EDA任务的可能性;以及(4) AI在测试与验证中的作用,涉及由大型语言模型辅助的验证工具、机器学习增强型SAT解决方法、安全性和可靠性挑战等。 报告建议NSF促进AI/EDA的合作发展,投资于为EDA量身定制的基础性AI研究,建立稳健的数据基础设施,并推广可扩展的计算架构。同时,还提议加大对人才培养的投资力度,以普及硬件设计并实现下一代硬件系统的开发。 有关此次研讨会的信息,请访问该网址:[此URL](请将"[此URL]"替换为实际链接)。
https://arxiv.org/abs/2601.14541
Postoperative complications remain a critical concern in clinical practice, adversely affecting patient outcomes and contributing to rising healthcare costs. We present MIRACLE, a deep learning architecture for prediction of risk of postoperative complications in lung cancer surgery by integrating preoperative clinical and radiological data. MIRACLE employs a hyperspherical embedding space fusion of heterogeneous inputs, enabling the extraction of robust, discriminative features from both structured clinical records and high-dimensional radiological images. To enhance transparency of prediction and clinical utility, we incorporate an interventional deep learning module in MIRACLE, that not only refines predictions but also provides interpretable and actionable insights, allowing domain experts to interactively adjust recommendations based on clinical expertise. We validate our approach on POC-L, a real-world dataset comprising 3,094 lung cancer patients who underwent surgery at Roswell Park Comprehensive Cancer Center. Our results demonstrate that MIRACLE outperforms various traditional machine learning models and contemporary large language models (LLM) variants alone, for personalized and explainable postoperative risk management.
术后并发症仍然是临床实践中一个关键的担忧,它们不仅影响患者的预后,还导致医疗成本上升。我们介绍了一种名为 MIRACLE 的深度学习架构,该架构通过整合术前临床和放射学数据来预测肺癌手术后的并发症风险。MIRACLE 使用一种超球体嵌入空间融合技术处理异构输入,能够从结构化的临床记录和高维度的放射图像中提取出强大且具有区分度的特征。 为了增强预测的透明度及其在临床上的应用价值,我们在 MIRACLE 中引入了一个干预性深度学习模块。该模块不仅优化了预测结果,还提供了可解释性和实用性的见解,使得领域专家可以根据临床经验互动地调整推荐方案。 我们使用来自罗斯韦尔公园综合癌症中心的真实世界数据集 POC-L 对我们的方法进行了验证。POC-L 数据集包括 3,094 名接受手术的肺癌患者的数据。研究结果表明,在个性化和可解释性后手术风险管控方面,MIRACLE 的表现优于各种传统机器学习模型以及当代大型语言模型 (LLM) 单独使用的效果。 综上所述,该方法为精准医疗中术后并发症的风险预测提供了一种新的、有效的工具,并且其透明度和临床实用性使其在实际应用中有更大的潜力。
https://arxiv.org/abs/2601.14154
Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to interact with external APIs, greatly enhancing their practical utility. While prior research has improved tool-calling performance by adopting traditional computer systems techniques, such as parallel and asynchronous execution, the challenge of redundant or repeated tool-calling requests remains largely unaddressed. Caching is a classic solution to this problem, but applying it to LLM tool-calling introduces new difficulties due to heterogeneous request semantics, dynamic workloads, and varying freshness requirements, which render conventional cache policies ineffective. To address these issues, we propose ToolCaching, an efficient feature-driven and adaptive caching framework for LLM tool-calling systems. ToolCaching systematically integrates semantic and system-level features to evaluate request cacheability and estimate caching value. At its core, the VAAC algorithm integrates bandit-based admission with value-driven, multi-factor eviction, jointly accounting for request frequency, recency, and caching value. Extensive experiments on synthetic and public tool-calling workloads demonstrate that ToolCaching with VAAC achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies, effectively accelerating LLM tool-calling in practical applications.
近期在大型语言模型(LLM)领域的进展已经彻底革新了网页应用,使得具备自然语言界面的智能搜索、推荐和助手服务成为可能。工具调用扩展了LLM的能力,使其能够与外部API交互,极大地增强了其实用性。虽然先前的研究通过采用诸如并行和异步执行等传统计算机系统技术来提高工具调用性能,但仍然存在大量未解决的问题,即如何避免冗余或重复的工具调用请求。缓存是一种经典的解决方案,但它在应用于LLM工具调用时会遇到新的困难,因为这涉及到不同类型的请求语义、动态工作负载以及不同的新鲜度需求,这些问题使得传统的缓存策略变得无效。 为了解决这些问题,我们提出了ToolCaching,这是一种高效的特征驱动和自适应的缓存框架,用于LLM工具调用系统。ToolCaching系统地整合了语义和系统级别的特性来评估请求是否可以被缓存以及估算缓存的价值。其核心是VAAC算法,它结合了基于多臂赌博机(bandit-based)的准入机制与价值驱动、多因素驱逐策略,综合考虑了请求频率、最近访问时间及缓存价值等因素。 在合成和公开发布的工具调用工作负载上的广泛实验表明,相比标准政策,采用ToolCaching和VAAC算法可使缓存命中率提高多达11%,同时将延迟降低34%,从而有效地加速了LLM工具调用的实际应用性能。
https://arxiv.org/abs/2601.15335
Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.
人工智能已经重塑了医学影像领域,但将其应用于临床数据以进行前瞻性决策支持的使用仍然有限。我们研究了一种在慢性鼻窦炎(CRS)手术前预测有意义临床改善的方法,定义成功为术后六个月SNOT-22评分降低超过8.9分(最小临床重要差异)。在一个所有患者都接受过手术的前瞻性收集队列中,我们探讨了仅基于术前临床数据的模型是否可以识别那些可能预后不良、即不应该进行手术的患者。我们将监督式机器学习(逻辑回归、树集成以及内部多层感知器MLP)与生成性人工智能(ChatGPT、Claude、Gemini、Perplexity)进行了比较,给定相同的结构化输入,并限制输出为二进制推荐结果及其置信度。 我们的最佳ML模型(MLP)达到了85%的准确率,在校准和决策曲线净收益方面表现出色。相比之下,生成性AI模型在零样本设置下的区分能力和校准表现较差。值得注意的是,生成性AI提供的理由与临床医生的经验法则以及MLP特征重要性的评估相吻合,反复强调了术前SNOT-22评分、CT/内窥镜检查严重程度、鼻息肉表型和心理疼痛共病的重要性。 我们提供了一个可重复的表格到生成性AI评估协议,并进行了亚组分析。研究结果支持了一种以机器学习为主导、利用生成性人工智能增强透明度和共同决策的工作流程:部署校准后的ML模型进行主要的手术资格初步筛选,使用生成性AI作为解释工具来提高透明度和共同决策质量。
https://arxiv.org/abs/2601.13710
Reinforcement learning plays a crucial role in generative re-ranking scenarios due to its exploration-exploitation capabilities, but existing generative methods mostly fail to adapt to the dynamic entropy changes in model difficulty during list generation, making it challenging to accurately capture complex preferences. Given that language models have achieved remarkable breakthroughs by integrating reasoning capabilities, we draw on this approach to introduce a latent reasoning mechanism, and experimental validation demonstrates that this mechanism effectively reduces entropy in the model's decision-making process. Based on these findings, we introduce the Entropy-Guided Latent Reasoning (EGLR) recommendation model, which has three core advantages. First, it abandons the "reason first, recommend later" paradigm to achieve "reasoning while recommending", specifically designed for the high-difficulty nature of list generation by enabling real-time reasoning during generation. Second, it implements entropy-guided variable-length reasoning using context-aware reasoning token alongside dynamic temperature adjustment, expanding exploration breadth in reasoning and boosting exploitation precision in recommending to achieve a more precisely adapted exploration-exploitation trade-off. Third, the model adopts a lightweight integration design with no complex independent modules or post-processing, enabling easy adaptation to existing models. Experimental results on two real-world datasets validate the model's effectiveness, and its notable advantage lies in being compatible with existing generative re-ranking models to enhance their performance. Further analyses also demonstrate its practical deployment value and research potential.
强化学习在生成重排序场景中因其探索与利用的能力而扮演着关键角色,但现有的生成方法大多无法适应列表生成过程中模型难度的动态熵变化,这使得准确捕捉复杂偏好变得具有挑战性。鉴于语言模型通过整合推理能力取得了显著突破,我们借鉴这一方法引入了一种潜在的推理机制,并且实验验证表明这种机制有效降低了模型决策过程中的熵值。基于这些发现,我们提出了基于熵引导的潜在推理(EGLR)推荐模型,该模型具备三大核心优势。 首先,EGLR 模型摒弃了传统的“先推理后推荐”的模式,转而实现“边推荐边推理”,特别针对列表生成难度高的特性,在生成过程中实现了实时推理。其次,通过上下文感知的推理标记和动态温度调节实施基于熵引导的可变长度推理,扩展了推理过程中的探索范围,并提高了在推荐时利用的精度,从而更好地平衡探索与利用之间的关系。最后,该模型采用了轻量级集成设计,无需复杂的独立模块或后处理步骤,使得其可以轻松适应现有的生成重排序模型。 基于两个真实数据集上的实验结果验证了该模型的有效性,而其显著优势在于能够兼容现有生成重排序模型以增强它们的性能。进一步分析还表明了其实用部署价值和研究潜力。
https://arxiv.org/abs/2601.13533
Customer reviews contain detailed, domain specific signals about service failures and user expectations, but converting this unstructured feedback into actionable business decisions remains difficult. We study review-to-action generation: producing concrete, implementable recommendations grounded in review text. We propose a modular two-LLM framework in which an Issue model extracts salient issues and assigns coarse themes, and an Advice model generates targeted operational fixes conditioned on the extracted issue representation. To enable specialization without expensive full fine-tuning, we adapt the Advice model using a mixture of LoRA experts strategy: multiple low-rank adapters are trained and a lightweight gating mechanism performs token-level expert mixing at inference, combining complementary expertise across issue types. We construct synthetic review-issue-advice triples from Yelp reviews (airlines and restaurants) to supervise training, and evaluate recommendations using an eight dimension operational rubric spanning actionability, specificity, feasibility, expected impact, novelty, non-redundancy, bias, and clarity. Across both domains, our approach consistently outperforms prompting-only and single-adapter baselines, yielding higher actionability and specificity while retaining favorable efficiency-quality trade-offs.
客户评论包含有关服务故障和用户期望的详细且特定领域的信号,但将这种非结构化反馈转化为可操作的商业决策仍然具有挑战性。我们研究了从评论生成行动(review-to-action):基于评论文本产生具体、可行的建议。 为此,我们提出了一种模块化的双LLM框架,在该框架中,“问题模型”提取突出的问题并分配粗略的主题,“建议模型”则在提取出的问题表示的基础上生成有针对性的操作性解决方案。为了使模型专业化而不进行昂贵的整体微调,我们将“建议模型”通过混合LoRA专家策略来适应:训练多个低秩适配器,并且一个轻量级的门控机制执行推理时的逐令牌专家混合,结合不同问题类型的专业知识。 我们从Yelp评论(包括航空和餐厅)中构建了合成的评论-问题-建议三元组以监督训练,并使用包含八个维度的操作准则来评估建议:操作性、具体性、可行性、预期影响、新颖性、非冗余性、偏见和清晰度。在两个领域内,我们的方法始终优于仅凭提示和单个适配器基线,在提高行动性和具体性的同时保持了效率与质量之间的有利权衡。
https://arxiv.org/abs/2601.12338
The design-build-test cycle is essential for innovation, but physical prototyping is often slow and expensive. Although physics-based simulation and strategic prototyping can reduce cost, meaningful evaluation is frequently constrained until an integrated prototype is built. This paper investigates whether a generative pretrained transformer (GPT) can predict information typically obtained through prototyping, including cost, performance, and perceived usability. We introduce a retrieval-augmented generation (RAG) method to emulate design feedback using OpenAI GPT-4o, grounded in prototyping data scraped from this http URL to increase access to relevant precedent. Two studies are reported. First, a controlled experiment compares GPT-RAG and human designers, who receive design sketches and predict cost, performance, and usability; predictions are evaluated against ground-truth results from physical prototypes. Second, we report an applied demonstration in which a physical prototype is produced from GPT-RAG recommendations and compared with a commercial baseline and a topology-optimized design. Results show that GPT-RAG provides more accurate cost and performance estimates than individual or crowd human estimates, while yielding comparable usability insights; the GPT-RAG-informed prototype also outperforms both comparison prototypes. Repeated querying with response averaging significantly improves accuracy, suggesting that LLMs can emulate crowd aggregation effects consistent with the law of large numbers.
设计-建造-测试周期对创新至关重要,但物理原型制作通常既缓慢又昂贵。尽管基于物理的模拟和战略性原型制作可以降低成本,有意义的评估往往要等到集成型原型建立之后才能进行。本文探讨了生成预训练变换器(GPT)是否能够预测通过原型制作获取的信息,包括成本、性能以及感知可用性。我们引入了一种检索增强生成(RAG)方法,利用从该网站(此URL处提供数据链接)收集的原型数据来模拟设计反馈,并借助OpenAI GPT-4o提高相关先例的可访问性。 本文报告了两项研究结果: 第一项是受控实验,对比GPT-RAG和人类设计师在收到设计草图后预测成本、性能及可用性的能力。这些预测与物理原型的实际测试结果进行比较以评估准确性; 第二项是一个应用演示,在此过程中从GPT-RAG建议中制造了一个物理原型,并将其与商用基准以及拓扑优化设计进行了对比。 实验结果显示,GPT-RAG提供的成本和性能预估比个体或群体的人类预测更加准确,同时在可用性方面的见解也相当一致。基于GPT-RAG的原型在表现上优于两个比较中的原型。 通过多次查询并采用响应平均值显著提高了准确性,这表明大规模的语言模型可以模拟类似群体聚合的效果,并且符合大数法则。
https://arxiv.org/abs/2601.12276
Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.
视觉-语言模型(VLM)正在迅速取代现代检索和推荐系统中的单模态编码器。尽管它们的能力已被广泛记录,但在竞争排名场景中对抗操纵的鲁棒性仍然很大程度上未被探索。在本文中,我们揭示了基于VLM的产品搜索中的一项关键漏洞:多模态排序攻击。我们提出了一个多模态生成引擎优化(MGEO),这是一种新颖的对抗框架,它使恶意行为者能够通过联合优化不可察觉的图像扰动和流畅的文字后缀来不公平地推广目标产品。与现有的将模式单独处理的攻击不同,MGEO采用了一种交替梯度基优化策略,利用了VLM内部深层跨模态耦合的特点。在使用最新模型的真实世界数据集上进行的大量实验表明,我们的协同攻击显著优于仅基于文本和仅基于图像的基准方法。这些发现揭示了一个事实:多模态协同作用通常被认为是VLM的一个优势,但这种能力也可以被武器化来破坏搜索排名的完整性而不触发传统的内容过滤器。
https://arxiv.org/abs/2601.12263