In this paper, we propose a novel hierarchical framework for robot navigation in dynamic environments with heterogeneous constraints. Our approach leverages a graph neural network trained via reinforcement learning (RL) to efficiently estimate the robot's cost-to-go, formulated as local goal recommendations. A spatio-temporal path-searching module, which accounts for kinematic constraints, is then employed to generate a reference trajectory to facilitate solving the non-convex optimization problem used for explicit constraint enforcement. More importantly, we introduce an incremental action-masking mechanism and a privileged learning strategy, enabling end-to-end training of the proposed planner. Both simulation and real-world experiments demonstrate that the proposed method effectively addresses local planning in complex dynamic environments, achieving state-of-the-art (SOTA) performance. Compared with existing learning-optimization hybrid methods, our approach eliminates the dependency on high-fidelity simulation environments, offering significant advantages in computational efficiency and training scalability. The code will be released as open-source upon acceptance of the paper.
在这篇论文中,我们提出了一种新的层次化框架,用于在具有异构约束的动态环境中进行机器人导航。我们的方法利用了一个通过强化学习(RL)训练的图神经网络,以高效地估计机器人的代价函数(cost-to-go),这被形式化为局部目标推荐。随后,采用一个考虑了运动学限制的空间-时间路径搜索模块来生成参考轨迹,以便解决用于显式约束执行的非凸优化问题。更重要的是,我们引入了一个增量动作屏蔽机制和一种特权学习策略,使提出的规划器能够进行端到端训练。 模拟实验和真实世界实验均表明,提出的方法有效解决了复杂动态环境中的局部路径规划问题,并达到了最先进的(SOTA)性能水平。与现有的学习-优化混合方法相比,我们的方法消除了对高保真仿真环境的依赖,在计算效率和训练可扩展性方面具有显著优势。 该论文一旦被接受,代码将以开源形式发布。
https://arxiv.org/abs/2506.09859
Integrated terrestrial and non-terrestrial network (TN-NTN) architectures offer a promising solution for expanding coverage and improving capacity for the network. While non-terrestrial networks (NTNs) are primarily exploited for these specific reasons, their role in alleviating terrestrial network (TN) load and enabling energy-efficient operation has received comparatively less attention. In light of growing concerns associated with the densification of terrestrial deployments, this work aims to explore the potential of NTNs in supporting a more sustainable network. In this paper, we propose a novel online optimisation framework for integrated TN-NTN architectures, built on a multi-armed bandit (MAB) formulation and leveraging the Bandit-feedback Constrained Online Mirror Descent (BCOMD) algorithm. Our approach adaptively optimises key system parameters--including bandwidth allocation, user equipment (UE) association, and macro base station (MBS) shutdown--to balance network capacity and energy efficiency in real time. Extensive system-level simulations over a 24-hour period show that our framework significantly reduces the proportion of unsatisfied UEs during peak hours and achieves up to 19% throughput gains and 5% energy savings in low-traffic periods, outperforming standard network settings following 3GPP recommendations.
集成地面与非地面网络(TN-NTN)架构为扩展覆盖范围和提高网络容量提供了有前景的解决方案。尽管非地面网络(NTNs)主要因其特定原因而被利用,但它们在缓解地面网络(TN)负载并实现节能运行方面的作用却受到了相对较少的关注。鉴于对地面部署密集化的担忧日益增加,这项工作旨在探索NTN在网络可持续性支持方面的潜力。本文提出了一种新的在线优化框架,用于集成的TN-NTN架构,并基于多臂赌博机(MAB)形式和利用带反馈约束在线镜像下降算法(BCOMD)。我们的方法能够自适应地优化关键系统参数——包括带宽分配、用户设备(UE)关联以及宏基站(MBS)关闭,以实现在实时条件下平衡网络容量与能源效率。经过为期24小时的系统级模拟显示,本框架显著降低了高峰时段未满足用户的比例,并在低流量期间实现了高达19%的吞吐量增长和5%的能耗节约,优于遵循3GPP推荐标准的传统网络设置。
https://arxiv.org/abs/2506.09268
Graph Neural Networks (GNNs) have substantially advanced the field of recommender systems. However, despite the creation of more than a thousand knowledge graphs (KGs) under the W3C standard RDF, their rich semantic information has not yet been fully leveraged in GNN-based recommender systems. To address this gap, we propose a comprehensive integration of RDF KGs with GNNs that utilizes both the topological information from RDF object properties and the content information from RDF datatype properties. Our main focus is an in-depth evaluation of various GNNs, analyzing how different semantic feature initializations and types of graph structure heterogeneity influence their performance in recommendation tasks. Through experiments across multiple recommendation scenarios involving multi-million-node RDF graphs, we demonstrate that harnessing the semantic richness of RDF KGs significantly improves recommender systems and lays the groundwork for GNN-based recommender systems for the Linked Open Data cloud. The code and data are available on our GitHub repository: this https URL
图神经网络(GNN)在推荐系统领域取得了显著进展。然而,尽管已经根据W3C标准RDF创建了超过一千个知识图谱(KG),其丰富的语义信息尚未完全被基于GNN的推荐系统所利用。为解决这一问题,我们提出了一种全面整合RDF KG与GNN的方法,该方法既利用了RDF对象属性提供的拓扑信息,也利用了RDF数据类型属性的内容信息。我们的主要关注点是对各种GNN进行深入评估,分析不同语义特征初始化和图结构异质性种类如何影响它们在推荐任务中的表现。通过在涉及数百万节点的多个RDF图谱中执行实验,我们证明了利用RDF KG的语义丰富性显著改善了推荐系统,并为基于GNN的推荐系统的链接开放数据云奠定了基础。代码和数据可在我们的GitHub存储库中获得:this https URL
https://arxiv.org/abs/2506.08743
In this article, we present a novel multimodal feedback framework called MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments for generating personalized feedback on student learning activities. This framework consists of four key steps. First, peers and professors' assessments are conducted through standardized rubrics (that include both quantitative and qualitative evaluations). Second, multimodal data are collected during learning activities, including video recordings, audio capture, gaze tracking, physiological signals (heart rate, motion data), and behavioral interactions. Third, personalized feedback is generated using AI, synthesizing human-based evaluations and data-based multimodal insights such as posture, speech patterns, stress levels, and cognitive load, among others. Finally, students review their own performance through video recordings and engage in self-assessment and feedback visualization, comparing their own evaluations with peers and professors' assessments, class averages, and AI-generated recommendations. By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback. We tested MOSAIC-F in the context of improving oral presentation skills.
在本文中,我们提出了一种新颖的多模态反馈框架,称为MOSAIC-F,其名称是数据驱动的集成多模态学习分析(Multimodal Learning Analytics, MMLA)、观察、传感器、人工智能(AI)和协作评估的框架的缩写,旨在为学生的学生活动生成个性化反馈。该框架包含四个关键步骤。首先,通过标准化评分量表进行同伴和教授的评估(包括定量和定性评价)。其次,在学习活动中收集多模态数据,包括视频记录、音频捕捉、眼球追踪、生理信号(心率、运动数据)以及行为互动等。第三,利用人工智能生成个性化反馈,综合了基于人的评价及多模态数据洞察力,例如体态、说话模式、压力水平和认知负荷等方面的信息。最后,学生通过观看自己的视频记录进行自我评估,并参与反馈可视化过程,比较自己与同伴和教授的评分、班级平均分以及由AI生成的建议。 通过结合基于人类和基于数据的评价技术,该框架能够提供更准确、个性化且实用的反馈。我们已在提高口头表达技能的情境下测试了MOSAIC-F的应用效果。
https://arxiv.org/abs/2506.08634
Optimizing the presentation of search and recommendation results is crucial to enhancing user experience and engagement. Whole Page Optimization (WPO) plays a pivotal role in this process, as it directly influences how information is surfaced to users. While Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent and contextually relevant content, fine-tuning these models for complex tasks like WPO presents challenges. Specifically, the need for extensive human-annotated data to mitigate issues such as hallucinations and model instability can be prohibitively expensive, especially in large-scale systems that interact with millions of items daily. In this work, we address the challenge of fine-tuning LLMs for WPO by using user feedback as the supervision. Unlike manually labeled datasets, user feedback is inherently noisy and less precise. To overcome this, we propose a reward-based fine-tuning approach, PageLLM, which employs a mixed-grained reward mechanism that combines page-level and item-level rewards. The page-level reward evaluates the overall quality and coherence, while the item-level reward focuses on the accuracy and relevance of key recommendations. This dual-reward structure ensures that both the holistic presentation and the critical individual components are optimized. We validate PageLLM on both public and industrial datasets. PageLLM outperforms baselines and achieves a 0.44\% GMV increase in an online A/B test with over 10 million users, demonstrating its real-world impact.
优化搜索和推荐结果的展示对于提升用户体验和用户参与度至关重要。全页面优化(Whole Page Optimization,WPO)在此过程中扮演着关键角色,因为它直接影响信息如何呈现给用户。虽然预训练的大语言模型(Pre-trained Large Language Models, LLMs)在生成连贯且上下文相关的文本方面表现出色,但在像WPO这样复杂任务上的微调面临挑战。特别是,为了减少幻觉和模型不稳定性等问题,需要大量的人工标注数据,这在每天处理数百万项的大规模系统中可能成本过高。 本文提出了一种使用用户反馈来监督LLMs微调的方法,以解决这一问题。与手动标注的数据集不同,用户反馈固有的噪声更大且不够精确。为了解决这个问题,我们提出了一个基于奖励的微调方法——PageLLM,该方法采用混合粒度的奖励机制,结合页面级和项目级的奖励。页面级别的奖励评估整体质量和连贯性,而项目的级别奖励则侧重于关键推荐的准确性和相关性。这种双层奖励结构确保了全局展示效果以及关键个别组件都被优化。 我们使用公开数据集和工业数据集对PageLLM进行了验证。在包含超过10万用户的在线A/B测试中,PageLLM超越了基线模型,并实现了GMV(商品交易总额)增长0.44%的显著效果,展示了其实际应用中的影响力。
https://arxiv.org/abs/2506.09084
The rise of large language models (LLMs) has created new possibilities for digital twins in healthcare. However, the deployment of such systems in consumer health contexts raises significant concerns related to hallucination, bias, lack of transparency, and ethical misuse. In response to recommendations from health authorities such as the World Health Organization (WHO), we propose Responsible Health Twin (RHealthTwin), a principled framework for building and governing AI-powered digital twins for well-being assistance. RHealthTwin processes multimodal inputs that guide a health-focused LLM to produce safe, relevant, and explainable responses. At the core of RHealthTwin is the Responsible Prompt Engine (RPE), which addresses the limitations of traditional LLM configuration. Conventionally, users input unstructured prompt and the system instruction to configure the LLM, which increases the risk of hallucination. In contrast, RPE extracts predefined slots dynamically to structure both inputs. This guides the language model to generate responses that are context aware, personalized, fair, reliable, and explainable for well-being assistance. The framework further adapts over time through a feedback loop that updates the prompt structure based on user satisfaction. We evaluate RHealthTwin across four consumer health domains including mental support, symptom triage, nutrition planning, and activity coaching. RPE achieves state-of-the-art results with BLEU = 0.41, ROUGE-L = 0.63, and BERTScore = 0.89 on benchmark datasets. Also, we achieve over 90% in ethical compliance and instruction-following metrics using LLM-as-judge evaluation, outperforming baseline strategies. We envision RHealthTwin as a forward-looking foundation for responsible LLM-based applications in health and well-being.
大型语言模型(LLMs)的兴起为医疗保健中的数字孪生创造了新的可能性。然而,在消费者健康环境中部署此类系统引发了与幻觉、偏见、缺乏透明度和伦理滥用相关的重大担忧。响应世界卫生组织(WHO)等卫生机构的建议,我们提出了负责任健康孪生体(RHealthTwin),这是一个用于构建和管理旨在支持健康的AI驱动数字孪生的基本框架。RHealthTwin处理多模态输入,引导以健康为重点的LLM生成安全、相关且可解释的响应。 在RHealthTwin的核心是负责任提示引擎(Responsible Prompt Engine, RPE),它解决了传统LLM配置的局限性。通常情况下,用户会输入无结构化的提示和系统指令来配置LLM,这增加了幻觉的风险。相比之下,RPE动态提取预定义槽位以结构化这些输入。这引导语言模型生成具有上下文感知、个性化、公平、可靠且可解释性的响应,以支持健康福祉。 该框架通过反馈循环进行迭代更新,根据用户满意度调整提示结构。我们在包括心理健康支持、症状分诊、营养规划和活动指导在内的四个消费者健康领域对RHealthTwin进行了评估。在基准数据集上,RPE在BLEU得分为0.41、ROUGE-L得分为0.63以及BERTScore为0.89的指标中取得了最先进的结果。此外,我们通过LLM-as-judge评估实现了超过90%的伦理合规性和指令遵循度指标,超过了基线策略的表现。 我们设想RHealthTwin将成为一个前瞻性的基础框架,用于在健康和福祉领域负责任地应用基于LLM的应用程序。
https://arxiv.org/abs/2506.08486
Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.
文本到图像的模型常常难以生成与文字提示完全匹配的图像。先前的研究已经广泛研究了在文本到图像生成中评估图像和文本对齐的方法。然而,现有的评估方法主要集中在与人类判断的一致性上,忽略了可靠评估框架所需的其他关键属性。在这项工作中,我们首先确定了一个可靠的评估应该解决的两个关键方面。然后,我们通过实证展示了当前主流的评估框架在各种度量标准和模型下未能充分满足这些性质。最后,我们提出了改进图像-文本对齐评估的建议。
https://arxiv.org/abs/2506.08480
Graph recommendation systems have been widely studied due to their ability to effectively capture the complex interactions between users and items. However, these systems also exhibit certain vulnerabilities when faced with attacks. The prevailing shilling attack methods typically manipulate recommendation results by injecting a large number of fake nodes and edges. However, such attack strategies face two primary challenges: low stealth and high destructiveness. To address these challenges, this paper proposes a novel graph backdoor attack method that aims to enhance the exposure of target items to the target user in a covert manner, without affecting other unrelated nodes. Specifically, we design a single-node trigger generator, which can effectively expose multiple target items to the target user by inserting only one fake user node. Additionally, we introduce constraint conditions between the target nodes and irrelevant nodes to mitigate the impact of fake nodes on the recommendation system's performance. Experimental results show that the exposure of the target items reaches no less than 50% in 99% of the target users, while the impact on the recommendation system's performance is controlled within approximately 5%.
图推荐系统由于能够有效捕捉用户与物品之间的复杂交互而被广泛研究。然而,这些系统在面对攻击时也表现出一定的脆弱性。盛行的水军攻击方法通常通过注入大量虚假节点和边来操控推荐结果。不过,此类攻击策略面临着两大主要挑战:低隐蔽性和高破坏性。为应对这些挑战,本文提出了一种新颖的图后门攻击方法,旨在以隐蔽的方式增强目标物品对目标用户的曝光率,并且不对其他无关节点造成影响。具体而言,我们设计了一个单节点触发器生成器,通过插入一个虚假用户节点即可有效使多个目标物品对目标用户可见。此外,我们还引入了目标节点与无关节点之间的约束条件,以减轻假节点对推荐系统性能的影响。实验结果显示,在99%的目标用户中,目标物品的曝光率达到不低于50%,而对推荐系统的性能影响则控制在大约5%以内。
https://arxiv.org/abs/2506.08401
We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.
我们研究了17个常用的评估视觉语言模型(VLMs)组合理解能力的基准测试(例如SugarCREPE和VALSE)。我们仔细审查了这些基准测试的设计选择,包括数据来源(如MS-COCO)以及整理程序(如构建负样本图像/描述),并发现大多数基准中存在一些固有偏差。我们发现盲策略(如词元长度、在语言模型下的对数似然)的表现与CLIP模型相当,这表明这些基准测试未能有效衡量组合理解能力。我们证明了这种现象的根本原因是正样本和负样本图像/描述之间的分布不对称,这是由基准构建程序所引起的。为了缓解这些问题,我们提出了几个关键建议,用于构建更加稳健的视觉语言组合理解基准测试,以减少对简单攻击手段的敏感性。
https://arxiv.org/abs/2506.08227
We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at this https URL
我们介绍了一种基于趋势感知和视觉定位的时尚推荐系统,该系统整合了深度视觉表示、服装感知分割、语义类别相似性和用户行为模拟。我们的流程通过使用预训练的CNN骨干(ResNet-50、DenseNet-121、VGG16)提取特征来提取聚焦的视觉嵌入,并在此前用语义分割遮罩非服装区域。为了模拟现实中的购物行为,我们根据用户的特定趋势性和商品流行度生成合成购买历史记录。推荐是通过一个加权评分函数计算得出的,该函数融合了视觉相似性、语义一致性和流行度匹配。 在DeepFashion数据集上的实验表明,我们的方法实现了持续的性别一致性以及类别的相关性的提升。其中ResNet-50模型达到了64.95%的类别相似率和最低的流行度MAE(均方误差)。一个消融研究确认了视觉线索和流行度线索在推荐系统中的互补作用。 我们提出的方法提供了一个可扩展框架,用于个性化时尚推荐,该框架能够在平衡个人风格与新兴趋势之间找到最佳结合点。我们的实现代码可以在这里获取:[此链接](请将此处的“this https URL”替换为实际提供的链接)。
https://arxiv.org/abs/2506.07773
Personalized recommendation systems must adapt to user interactions across different domains. Traditional approaches like MLoRA apply a single adaptation per domain but lack flexibility in handling diverse user behaviors. To address this, we propose MoE-MLoRA, a mixture-of-experts framework where each expert is first trained independently to specialize in its domain before a gating network is trained to weight their contributions dynamically. We evaluate MoE-MLoRA across eight CTR models on Movielens and Taobao, showing that it improves performance in large-scale, dynamic datasets (+1.45 Weighed-AUC in Taobao-20) but offers limited benefits in structured datasets with low domain diversity and sparsity. Further analysis of the number of experts per domain reveals that larger ensembles do not always improve performance, indicating the need for model-aware tuning. Our findings highlight the potential of expert-based architectures for multi-domain recommendation systems, demonstrating that task-aware specialization and adaptive gating can enhance predictive accuracy in complex environments. The implementation and code are available in our GitHub repository.
个性化推荐系统必须适应用户在不同领域中的互动。传统方法如MLoRA在一个领域中应用单一的调整策略,但缺乏处理多样化用户行为的能力。为了解决这个问题,我们提出了MoE-MLoRA(混合专家框架),其中每个专家首先独立训练以专门针对其特定领域,然后通过训练一个门控网络来动态加权它们的贡献。 我们在八个CTR模型上对Movielens和Taobao的数据集进行了评估,结果显示MoE-MLoRA在大规模、动态数据集中提升了性能(例如,在淘宝20数据集中提高了1.45个加权AUC值),但在领域多样性低且稀疏的结构化数据集中收益有限。 进一步分析每个领域的专家数量表明,较大的专家集合并不总是能提高表现,这说明需要根据模型进行调优。我们的研究结果强调了基于专家架构在多领域推荐系统中的潜力,证明任务专门化和自适应门控可以增强复杂环境下的预测准确性。实现代码可在我们的GitHub仓库中找到。
https://arxiv.org/abs/2506.07563
Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\href{this https URL}{repository}.
最近在大型语言模型(LLMs)方面的进展推动了通过检索增强生成(RAG)框架将其应用于推荐系统。然而,现有的 RAG 方法主要依赖于基于相似性的扁平检索方法,这无法利用用户-项目交互中固有的丰富关系结构。我们介绍了 LlamaRec-LKG-RAG,这是一个新颖的单次、端到端可训练框架,它将个性化的知识图谱上下文整合到了基于 LLM 的推荐排名系统中。我们的方法通过加入一个轻量级的用户偏好模块扩展了 LlamaRec 架构,该模块能够动态识别在由用户行为和项目元数据构建的异构知识图谱中的显著关系路径。这些个性化的子图可以无缝地整合到针对 Llama-2 模型微调后的提示中,从而通过统一的推理步骤实现高效且可解释的推荐。在 ML-100K 和 Amazon Beauty 数据集上的全面实验显示,在关键排名指标(MRR、NDCG、召回率)上相对于 LlamaRec 有持续和显著的改进。LlamaRec-LKG-RAG 展示了结构化推理在基于 LLM 的推荐中的关键价值,并为下一代个性化推荐系统的可扩展知识感知基础设定了标准。代码可在[此仓库](this https URL)中获取。
https://arxiv.org/abs/2506.07449
Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset's effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.
有效的评估对于推动多模态大型语言模型(MLLM)研究的进步至关重要。外科手术规划(SAP)任务旨在从视觉输入中生成未来的行动序列,这需要精确和复杂的分析能力。与数学推理不同,外科决策在生命攸关的领域运作,并且需要细致、可验证的过程以确保可靠性和患者安全。这项任务要求模型能够区分原子级别的视觉动作并协调复杂的长期程序,而当前基准无法充分评估这些能力。为解决这一差距,我们引入了SAP-Bench,这是一个大规模、高质量的数据集,旨在使多模态大型语言模型能够在解释性的外科手术行动规划中发挥作用。 我们的SAP-Bench基准测试基于胆囊切除术程序的背景,平均持续时间为1137.5秒,并引入了以时间为基础的外科手术动作注释,包括1226个临床验证的动作片段(平均持续时间:68.7秒),涵盖了五种基本的外科手术行动,在74项不同手术中被记录。该数据集提供了1,152张当前场景的画面,每一张都与对应的下一步行动配对作为多模态分析锚点。 我们提出了MLLM-SAP框架,利用多模态大型语言模型根据当前的外科手术现场和自然语言指令生成下一步动作建议,并通过注入的专业领域知识进行增强。为了评估我们的数据集的有效性和现有模型的整体能力,我们评测了七种最先进的多模态大型语言模型(例如OpenAI-o1、GPT-4o、QwenVL2.5-72B、Claude-3.5-Sonnet、GeminiPro2.5、Step-1o和GLM-4v),并揭示了下一步动作预测性能方面的关键差距。
https://arxiv.org/abs/2506.07196
As large language models (LLMs) have progressed towards more human-like and human--AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human--AI communication and opening new prompting research directions.
随着大型语言模型(LLM)的发展,人类之间的交流变得更加像人与人的对话,并且人机交互也变得越来越普遍。因此,提示语(prompts)已经成为决定性因素之一。然而,在何谓自然语言提示这一概念上还缺乏广泛的共识。为了解决这个问题,我们通过元分析调查了从2022年到2025年间在顶尖的NLP和AI会议上发表的超过150篇与提示语相关的论文以及博客文章。我们提出了一个以属性和人为中心的框架来评估提示的质量,该框架包含了按六个维度分类的21个属性。 接着,我们考察了现有研究如何评价它们对LLM的影响,发现它们在模型和支持任务方面存在不平衡,并且有大量的研究空白。进一步地,我们分析了高质量自然语言提示中各属性之间的相关性,从而得出关于提示使用的建议。然后,我们通过实验探索了推理任务中的多属性提示增强效果,观察到单个属性的改进通常具有最大的影响。 最后,我们发现,在经过提升后的属性基础上进行指令调优可以产生更好的推理模型。我们的研究成果为以属性为中心的提示评估和优化奠定了基础,并弥合了人类与AI交流之间的差距,同时也开启了新的提示研究方向。
https://arxiv.org/abs/2506.06950
Navigating healthcare systems can be complex and overwhelming, creating barriers for patients seeking timely and appropriate medical attention. In this paper, we introduce C-PATH (Conversational Patient Assistance and Triage in Healthcare), a novel conversational AI system powered by large language models (LLMs) designed to assist patients in recognizing symptoms and recommending appropriate medical departments through natural, multi-turn dialogues. C-PATH is fine-tuned on medical knowledge, dialogue data, and clinical summaries using a multi-stage pipeline built on the LLaMA3 architecture. A core contribution of this work is a GPT-based data augmentation framework that transforms structured clinical knowledge from DDXPlus into lay-person-friendly conversations, allowing alignment with patient communication norms. We also implement a scalable conversation history management strategy to ensure long-range coherence. Evaluation with GPTScore demonstrates strong performance across dimensions such as clarity, informativeness, and recommendation accuracy. Quantitative benchmarks show that C-PATH achieves superior performance in GPT-rewritten conversational datasets, significantly outperforming domain-specific baselines. C-PATH represents a step forward in the development of user-centric, accessible, and accurate AI tools for digital health assistance and triage.
在医疗系统中导航可能既复杂又令人感到不知所措,这会为寻求及时和适当医疗服务的患者制造障碍。本文介绍了C-PATH(Conversational Patient Assistance and Triage in Healthcare),这是一种新型对话式人工智能系统,由大型语言模型(LLMs)提供支持,并设计用于通过自然、多轮对话帮助患者识别症状并推荐适当的医疗部门。C-PATH在医学知识、对话数据和临床摘要的基础上使用基于LLaMA3架构的多阶段管道进行了微调。这项工作的核心贡献之一是一种基于GPT的数据增强框架,该框架将DDXPlus中的结构化临床知识转换为易于普通民众理解的对话,使其与患者的沟通规范保持一致。我们还实施了一种可扩展的对话历史管理策略,以确保长期连贯性。使用GPTScore进行的评估显示,在清晰度、信息性和推荐准确性等维度上表现优异。定量基准测试表明,C-PATH在经过GPT重写的对话数据集中表现出色,显著优于特定领域的基线模型。C-PATH代表了为数字健康辅助和分诊开发以用户为中心、易于访问且准确的人工智能工具方面的一个重要进展。
https://arxiv.org/abs/2506.06737
Scientific recommender systems, such as Google Scholar and Web of Science, are essential tools for discovery. Search algorithms that power work through stigmergy, a collective intelligence mechanism that surfaces useful paths through repeated engagement. While generally effective, this ``rich-get-richer'' dynamic results in a small number of high-profile papers that dominate visibility. This essay argues argue that these algorithm over-reliance on popularity fosters intellectual homogeneity and exacerbates structural inequities, stifling innovative and diverse perspectives critical for scientific progress. We propose an overhaul of search platforms to incorporate user-specific calibration, allowing researchers to manually adjust the weights of factors like popularity, recency, and relevance. We also advise platform developers on how word embeddings and LLMs could be implemented in ways that increase user autonomy. While our suggestions are particularly pertinent to aligning recommender systems with scientific values, these ideas are broadly applicable to information access systems in general. Designing platforms that increase user autonomy is an important step toward more robust and dynamic information
科学推荐系统,如谷歌学术和Web of Science,在发现方面是必不可少的工具。驱动这些搜索算法的是一个名为“标质互示”的集体智能机制,通过反复互动显示出有用的路径。虽然这种机制通常很有效,但这种“强者愈强”的动态结果导致少数几篇高知名度论文占据了主导地位,限制了其他作品的可见性。本文认为,过度依赖流行度的这些算法会导致思想同质化,并加剧结构性不平等,抑制科学进步所需的创新性和多样性观点。我们建议对搜索平台进行全面改革,纳入用户特定校准功能,使研究者能够手动调整诸如流行度、时效性和相关性的权重。此外,我们还向平台开发者提出了一些建议,说明如何在保障用户自主权的同时利用词嵌入技术和大型语言模型(LLMs)。尽管我们的建议特别针对的是使推荐系统与科学价值观相契合的问题,但这些想法同样适用于一般的信息获取系统。设计能够增强用户自主性的平台是迈向更稳健和动态信息环境的重要一步。 简而言之,本文主张改进现有推荐机制以促进更加多元化的学术交流,并为实现这一目标提供了具体的建议。
https://arxiv.org/abs/2506.06162
Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.
小麦管理策略在决定产量方面起着关键作用。传统的管理决策往往依赖于劳动密集型的专家现场检查,这种方法成本高昂、主观性强且难以扩展。最近,视觉-语言模型(VLMs)作为一种有前景的解决方案出现,能够为大规模数据驱动的管理支持提供帮助。然而,由于缺乏领域特定知识,直接将VLM应用于小麦管理任务会导致量化和推理能力不足,最终产生模糊甚至误导性的管理建议。 针对这一问题,我们提出了WisWheat,这是一个专为小麦设计的数据集,采用三层架构来增强VLM在小麦管理任务中的性能:(1)一个基础预训练数据集,包含47,871张图像-描述对,用于粗略地将VLM调整到与小麦形态相适应;(2)一个定量数据集,包括7,263个问答式图像-问题-答案三元组,用于量化的特征测量任务;以及(3)一个指令微调数据集,包含4,888个样本,针对不同生长阶段的生物和非生物压力诊断及管理计划。广泛的实验结果表明,在我们的数据集上对开源VLM进行微调可以显著提高其性能。具体而言,基于我们小麦指令数据集微调后的Qwen2.5 VL 7B模型在小麦压力和生长期对话任务中的准确率分别达到了79.2%和84.6%,超过了通用型商业模型如GPT-4o的11.9%和34.6%。
https://arxiv.org/abs/2506.06084
In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.
在这项工作中,我们介绍了LengClaro2023,这是一个包含西班牙语法律和行政文本的语料库。基于西班牙社会保障网站上最常用的程序,我们为每篇文档创建了两个简化的版本。第一个版本遵循arText claro提供的建议。第二个版本则结合了更多关于使用浅显语言指南的推荐,以探索系统改进的可能性。这项工作中创建的语言资源可用于评估西班牙语自动文本简化(ATS)系统的性能。
https://arxiv.org/abs/2506.05927
With the rapid growth of fintech, personalized financial product recommendations have become increasingly important. Traditional methods like collaborative filtering or content-based models often fail to capture users' latent preferences and complex relationships. We propose a hybrid framework integrating large language models (LLMs) and graph neural networks (GNNs). A pre-trained LLM encodes text data (e.g., user reviews) into rich feature vectors, while a heterogeneous user-product graph models interactions and social ties. Through a tailored message-passing mechanism, text and graph information are fused within the GNN to jointly optimize embeddings. Experiments on public and real-world financial datasets show our model outperforms standalone LLM or GNN in accuracy, recall, and NDCG, with strong interpretability. This work offers new insights for personalized financial recommendations and cross-modal fusion in broader recommendation tasks.
随着金融科技的快速发展,个性化金融产品推荐变得越来越重要。传统的协同过滤或基于内容的方法往往难以捕捉用户的潜在偏好和复杂关系。我们提出了一种结合大型语言模型(LLM)和图神经网络(GNN)的混合框架。预训练的语言模型可以将文本数据(如用户评论)编码成丰富的特征向量,而异构的用户-产品图则用于建模互动和社交联系。通过一种定制的消息传递机制,文本信息与图形信息在GNN中融合,并联合优化嵌入表示。我们在公开且真实的金融数据集上进行实验,结果表明我们的模型在准确率、召回率和NDCG等指标上优于单独使用LLM或GNN的性能,同时还具有较强的可解释性。这项工作为个性化金融推荐以及更广泛的推荐任务中的跨模态融合提供了新的见解。
https://arxiv.org/abs/2506.05873
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.
近期,大规模推理模型在具有挑战性的数学基准测试中取得了最先进的性能,但这些模型成功背后的内部机制仍然不为人们所充分理解。在这项工作中,我们引入了推理图的概念,通过聚类每个推理步骤中的隐藏状态表示来提取该图形,并系统地分析了三个关键的图论属性:循环性、直径和小世界指数,在多个任务(GSM8K、MATH500、AIME 2024)中进行研究。我们的发现表明,精简后的推理模型(例如 DeepSeek-R1-Distill-Qwen-32B)表现出显著更多的回旋周期(平均每样本约5次)、更大的图直径和明显的小世界特征(约为6倍),与它们的基础版本相比。值得注意的是,这些结构优势随着任务难度和模型容量的增加而增长,在14B规模时循环检测达到峰值,而在32B变体中探索直径最大化,并且这些特征与准确性呈正相关关系。此外,我们还展示了在改进的数据集上进行监督微调可以系统地扩大推理图的直径,同时提升性能,为旨在增强推理能力的数据集设计提供了具体指导。通过将对推理图结构的理论见解与数据构建的实际建议相结合,我们的工作不仅提高了大规模推理模型的理解性,还增强了其有效性。
https://arxiv.org/abs/2506.05744