Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
语法在自然语言处理和文本/代码生成中扮演着关键角色,它能够定义句法、创建解析器,并指导结构化输出。尽管大型语言模型(LLMs)在其广泛的应用领域表现出令人印象深刻的能力,但它们推断和生成语法规则的能力尚未得到充分探索。在这篇论文中,我们旨在研究并改进LLM在小样本语法生成中的能力,在这种情况下,从一组少量的正例和反例中推导出语法,并将其以Backus-Naur形式(BNF)生成出来。为了探究这一点,我们引入了一个包含540个结构化语法生成挑战的新数据集,设计了6种评估指标,并对8种不同的LLM进行了评测。我们的研究发现表明,现有的LLM在语法生成方面表现不佳。为了解决这个问题,我们提出了一种新的方法——HyGenar,这是一种由LLM驱动的混合遗传算法,旨在优化语法规则的生成过程。HyGenar显著提高了不同LLM在语法生成中的句法和语义正确性。
https://arxiv.org/abs/2505.16978
Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
评估长文事实性的指标,如FactScore和VeriScore,通过将输入响应分解为原子声明,并逐一验证每个声明来工作。尽管这些方法有效且易于理解,但它们需要进行大量的大型语言模型(LLM)调用,并且可能需要长达100秒才能评估单个响应,这在大规模评估和训练场景中是不切实际的。为了应对这一挑战,我们提出了VeriFastScore,该方法利用合成数据对Llama3.1 8B进行微调,使其能够同时从给定文本中提取并根据Google搜索提供的证据验证所有可核实声明。 我们展示了这样一个任务不能通过使用封闭式LLM的少量提示来解决,因为其复杂性:模型需要接收平均约4K令牌的证据,并且需要同时分解声明、判断它们的可证实性和将它们与噪声证据进行对比。然而,我们的微调VeriFastScore模型在示例级别(r=0.80)和系统级别(r=0.94)上都显示出与原始VeriScore管道具有很强的相关性,并且比VeriScore整体速度提高了6.6倍(不包括证据检索则为9.9倍)。 为了促进未来的事实性研究,我们公开发布了我们的VeriFastScore模型和合成数据集。
https://arxiv.org/abs/2505.16973
In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field.
在这篇论文中,我们展示了Ego4D EgoSchema挑战赛(CVPR 2025,确认于2025年5月20日)的亚军解决方案。受大型模型成功的影响,我们评估并利用了领先的可访问多模态大模型,并通过少量样本学习和模型集成策略将它们适应到视频理解任务中。具体而言,系统地探索和评估了多样化的提示风格和处理范式,以有效地引导大规模模型的注意力,充分释放其强大的泛化能力和适应能力。 实验结果表明,在我们精心设计的方法下,直接利用单一多模态模型已超越了包含多个额外过程的前一代最先进(SOTA)方法。此外,还引入了一个额外阶段来促进周期性结果的合作和集成,从而实现了显著的性能改进。我们希望这项工作能为大型模型的实际应用提供有价值的参考,并激发该领域的未来研究。
https://arxiv.org/abs/2505.16784
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
少量样本计数通过仅使用几个标注示例就估计图像中目标对象的数量。然而,领域偏移(domain shift)严重阻碍了现有方法向未见过的情况推广。这属于单域泛化范畴,在少量样本计数任务中尚未得到探索。 为了解决这一问题,我们首先分析当前方法的主要局限性:它们通常遵循标准流程,从示例中提取对象原型,并将其与图像特征匹配以构建相关图(correlation map)。我们认为现有方法忽视了学习高度通用的原型的重要性。基于此见解,我们提出了首个单域泛化少量样本计数模型——通用表示匹配(Universal Representation Matching, URM)。 我们的主要贡献在于发现将大规模预训练视觉-语言模型中提炼出的通用视觉-语言表示集成到相关图构建过程中,能够显著提高对领域偏移的鲁棒性而不损害领域内性能。因此,URM在内部领域和新提出的领域泛化设置下均实现了最先进的表现。
https://arxiv.org/abs/2505.16778
Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.
医学异常检测(AD)对于早期临床干预至关重要,但由于隐私问题和数据孤岛导致的高质量医疗影像数据访问受限,这一领域面临着挑战。少样本学习作为一种有前景的方法已经出现,通过利用嵌入在视觉-语言模型(VLMs)中的大规模先验知识来缓解这些限制。近期关于少样本医学异常检测的研究通常将正常情况与异常情况视为一类分类问题处理,并且往往忽略了多个异常类别之间的差异。因此,在本论文中我们提出了一种专为需要识别多种异常类别的少样本医学异常检测场景设计的框架。 为了捕捉医疗异常类别的详细放射学标志,我们的框架引入了由大型语言模型生成的各种文本描述来表示每个类别。该假设认为,不同类型的医学图像中的异常可能在每个类别内共享一些共同的放射学标志。具体而言,我们提出了SD-MAD,这是一种两阶段驱动标识符(Sign-Driven)的少样本多异常检测框架:(i) 通过放大各异常类之间的差异来对齐放射学标志与异常类别;(ii) 在推理时采用自动标识符选择策略进一步选取对齐后的标识符以缓解由有限医疗数据导致的欠拟合和不确定样本问题。 此外,我们还提出了三种协议以全面量化多异常检测性能。大量的实验展示了我们方法的有效性。
https://arxiv.org/abs/2505.16659
In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose Many-Shot Adaptive Pseudo-LabEling, namely MAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data.
上下文学习(ICL)通过将多个输入输出示例,即演示,纳入大型语言模型(LLM)的输入中,使这些模型能够处理各种任务。最近,随着LLM扩展上下文窗口的进步,许多示例ICL应运而生,这种方法使用数百个示例,并且比依赖较少示例的少量示例ICL表现更佳。然而,这种策略常常受到获取大量标注数据成本高昂的阻碍。为了解决这一挑战,我们提出了“多示例自适应伪标签”(MAPLE),这是一种基于影响的新颖多示例ICL框架,它利用伪标记样本来弥补标签信息不足的问题。首先,我们识别出一部分具有影响力的未标记样本,并通过查询LLM对其进行伪标注。然后,这些伪标注的样本会根据每个测试请求被自适应地选择并调整为输入的一部分,以提高多示例ICL的表现,而无需显著增加标注成本。在真实世界数据集上的广泛实验表明了我们框架的有效性,展示了其能够在有限标注数据的情况下增强LLM的适应性和性能的能力。
https://arxiv.org/abs/2505.16225
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
本文介绍了Meta-PerSER,这是一种新颖的元学习框架,它通过适应每个听众独特的解读情感的方式实现了个性化语音情感识别(Speech Emotion Recognition,简称SER)。传统的SER系统依赖于聚合注释,这些注释通常忽略了个体的细微差别,并导致预测不一致。相比之下,Meta-PerSER采用增强型模型不可知元学习(Model-Agnostic Meta-Learning,简称MAML)方法,结合了联合集元训练、导数退火和逐层逐步的学习率调整,从而能够仅通过少量标注示例实现快速适应。我们的框架首先利用预训练的自监督模型整合出稳健的情感特征表示,并在此基础上捕捉一般性情感线索,然后根据个人注释风格进行微调。 在IEMOCAP语料库上的实验表明,Meta-PerSER在已见数据和未见过数据场景中均显著优于基准方法,突显了其在个性化情绪识别方面的潜力。
https://arxiv.org/abs/2505.16220
Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
最近,基于推理的大规模语言模型(MLLMs)在生成长篇文本推理链方面取得了成功。然而,它们仍然难以处理那些需要动态和迭代地关注并回顾视觉区域以实现精确的文本推理与视觉证据结合的复杂任务。我们引入了\textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning,视觉语言模型带区域识别和推理),这是一个框架,它使MLLM具备了以下能力:(i)决定何时需要额外的视觉证据;(ii)确定在图像中何处进行定位;以及(iii)将相关的子图内容无缝地融入到交错的思想链中。我们方法的核心是\textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)},即一种训练范式,它通过选择信息丰富的区域、制定适当的变换(如裁剪和缩放),并将其产生的视觉上下文整合进后续推理步骤来奖励模型。为了启动这一策略,我们汇编了一个适度但精心策划的Visuo-Lingual Interleaved Rationale (VLIR)语料库,它在区域选择和文本证明上提供了分步监督。在MathVista、ScienceQA和其他基准测试中的广泛实验表明,VLM-R$^3$ 在零样本和少量样本设置中设立了新的性能标准,在需要精细空间推理或细微视觉线索提取的问题上表现出了最大的改进。
https://arxiv.org/abs/2505.16192
We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.
我们提出了LAGO(Language Similarity-Aware Graph Optimization),这是一种新颖的方法,用于解决多语言自然语言处理系统中关键隐私漏洞的少量样本跨语言嵌入逆向攻击问题。与先前将不同语言独立对待的嵌入逆向攻击工作不同,LAGO 通过图基约束分布式优化框架显式地建模了语言关系。通过将句法和词汇相似性作为边约束整合进来,我们的方法能够实现在相关语言间协同学习参数的目标。理论上,我们证明该公式可以推广之前的方法(例如ALGEN),当相似度约束被放松时,ALGEN 就会成为其特例。我们的框架独特地结合了弗罗贝尼乌斯范数正则化和线性不等式或总变化约束,即使在数据极度有限的情况下(每种语言只有10个样本),也能确保跨语言嵌入空间的稳健对齐。 经过多种语言和嵌入模型上的广泛实验,结果表明LAGO 在逆向攻击的可迁移性上有了显著提升,与基准方法相比,Rouge-L得分提高了10-20%。这项工作确立了语言相似度在逆向攻击转移性中的关键作用,并敦促重新关注具有语言意识的多语言隐私保护嵌入。
https://arxiv.org/abs/2505.16008
The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
大型语言模型(LLM)的快速发展需要一个严格的理论框架来解释其实验成功的原因。尽管在理解LLM行为方面取得了显著进展,但现有的理论框架仍然缺乏统一的数学视角来解释新兴现象。我们首次建立了LLM架构与算法信息论(AIT)之间的正式联系,并证明了两个基本结果:(1) 通过损失最小化实现程序长度优化的过程计算上近似于Solomonoff先验;(2) 下一令牌预测实现了对Solomonoff归纳的近似实施。我们利用AIT提供了一个统一的理论解释,涵盖了上下文学习、少量样本学习以及规模法则。此外,我们的理论见解为少样本示例选择提供了基于原则的方法,优先考虑模型表现出较低预测置信度的情况下的样本。通过在多种文本分类基准上的实验表明,与选取高置信度示例相比,该策略能显著提高较小架构模型的性能。我们的框架弥合了理论基础和实际LLM行为之间的差距,为未来模型的发展提供了解释力和实用性的洞察。
https://arxiv.org/abs/2505.15784
Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
基础模型(FMs)在连接语言和行动的具身代理中越来越常用,然而不同基础模型集成策略的操作特性仍然有待深入研究——尤其是在复杂指令执行和多变环境下的灵活动作生成方面。本文探讨了构建机器人系统时采用的三种范式:端到端视觉-语言-行动(VLA)模型,这种模型隐含地整合了感知与规划;以及模块化管道,这些管道结合了视觉-语言模型(VLMs)或跨模态大型语言模型(LLMs)。我们通过两个聚焦案例研究来评估这些范式:一个是复杂指令定位任务,该任务旨在测试细粒度的指令理解和跨模式歧义消除能力;另一个是目标操作任务,其目的是通过VLA微调来进行技能转移。我们在零样本和少量样本设置下的实验揭示了泛化能力和数据效率之间的权衡。通过对性能极限的研究,我们提炼出了为开发以语言驱动的物理代理的设计启示,并概述了基础模型在现实世界条件下赋能机器人技术所面临的新兴挑战与机遇。
https://arxiv.org/abs/2505.15685
Recently, Vision-Language foundation models like CLIP and ALIGN, which are pre-trained on large-scale data have shown remarkable zero-shot generalization to diverse datasets with different classes and even domains. In this work, we take a step further and analyze whether these models can be adapted to target datasets having very different distributions and classes compared to what these models have been trained on, using only a few labeled examples from the target dataset. In such scenarios, finetuning large pretrained models is challenging due to problems of overfitting as well as loss of generalization, and has not been well explored in prior literature. Since, the pre-training data of such models are unavailable, it is difficult to comprehend the performance on various downstream datasets. First, we try to answer the question: Given a target dataset with a few labelled examples, can we estimate whether further fine-tuning can enhance the performance compared to zero-shot evaluation? by analyzing the common vision-language embedding space. Based on the analysis, we propose a novel prompt-tuning method, PromptMargin for adapting such large-scale VLMs directly on the few target samples. PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules: 1) Firstly, we use a selective augmentation strategy to complement the few training samples in each task; 2) Additionally, to ensure robust training in the presence of unfamiliar class names, we increase the inter-class margin for improved class discrimination using a novel Multimodal Margin Regularizer. Extensive experiments and analysis across fifteen target benchmark datasets, with varying degrees of distribution shifts from natural images, shows the effectiveness of the proposed framework over the existing state-of-the-art approaches applied to this setting. this http URL.
最近,像CLIP和ALIGN这样的视觉-语言基础模型,在大规模数据上预训练后,显示出在具有不同类别甚至领域的多样化数据集上的零样本泛化能力。在这项工作中,我们更进一步地分析了这些模型是否可以通过仅使用目标数据集中的一些标注示例进行调整,以适应与它们所接受的初始训练数据有非常不同的分布和类别的目标数据集。在这种情况下,由于过度拟合以及泛化的丧失问题,对大型预训练模型进行微调具有挑战性,并且在先前文献中尚未得到充分研究。由于这些模型的预训练数据不可用,因此很难理解它们在各种下游数据集上的表现。 首先,我们试图回答这样一个问题:给定一个包含少量标注样本的目标数据集,我们能否通过分析共有的视觉-语言嵌入空间来估计进一步微调是否能提升性能? 基于这一分析,我们提出了一种新颖的提示调整方法——PromptMargin,用于直接在少数目标样本上适配大规模VLMs(视觉-语言模型)。PromptMargin有效地对文本和视觉提示进行调整,并具有两个主要模块: 1)首先,我们使用选择性增强策略来补充每个任务中的少量训练样本。 2)其次,在面对陌生的类名时为了确保稳健训练,我们利用一种新颖的多模态边际正则化器增加类别间的边际以提高类别区分度。 通过在十五个目标基准数据集上的广泛实验和分析(这些数据集具有不同程度的从自然图像到其他领域的分布变化),证明了所提出的框架相比现有的最先进方法在这类设置下的有效性。这一研究成果可以访问以下链接:[这里插入原文出处或链接]。
https://arxiv.org/abs/2505.15506
Deep learning has advanced computational pathology but expert annotations remain scarce. Few-shot learning mitigates annotation burdens yet suffers from overfitting and discriminative feature mischaracterization. In addition, the current few-shot multiple instance learning (MIL) approaches leverage pretrained vision-language models to alleviate these issues, but at the cost of complex preprocessing and high computational cost. We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in MIL models to address these challenges. The SR block comprises two core components: a pair of low-rank trainable matrices (squeeze pathway, SP) that reduces parameter count and imposes a bottleneck to prevent spurious feature learning, and a frozen random recalibration matrix that preserves geometric structure, diversifies feature directions, and redefines the optimization objective for the SP. We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision, thereby ensuring that the performance of a standard MIL model serves as a lower bound for its SR-enhanced counterpart. Extensive experiments demonstrate that our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes.
深度学习在计算病理学方面取得了进展,但专家标注仍然稀缺。少样本学习(Few-shot learning)减轻了标注负担,但却容易过拟合并错误表征判别特征。此外,当前的少样本多实例学习(Multiple Instance Learning, MIL)方法利用预训练的视觉-语言模型来缓解这些问题,但这增加了复杂的预处理步骤和高昂的计算成本。 为了解决这些挑战,我们提出了一种名为"Squeeze-and-Recalibrate" (SR) 的模块,它可以作为MIL模型中线性层的直接替换。该SR模块包含两个核心组件:一对低秩可训练矩阵(挤压路径,SP),它减少了参数数量并施加瓶颈以防止虚假特征学习;以及一个冻结的随机重标定矩阵,用于保持几何结构、多样化特征方向,并重新定义SP的优化目标。 我们提供了理论保证,证明SR模块可以任意精度逼近任何线性映射。这确保了一个标准MIL模型的表现为增强版本的一个下限表现。广泛的实验表明,我们的SR-MIL模型在需要更少参数和不进行架构调整的情况下,始终优于先前的方法。
https://arxiv.org/abs/2505.15504
Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.
医学视觉-语言模型(MVLMs)在医学图像分析中表现出卓越的泛化能力,但它们在存在噪声和损坏条件下的性能仍然很少被测试。临床成像固有地容易受到采集伪影和噪声的影响;然而,现有的评估主要集中在一般清洁的数据集上,忽略了鲁棒性——即模型在现实世界扭曲情况下的表现能力。为了解决这一差距,我们首先介绍了MediMeta-C,这是一个基准测试,系统地将多种扰动应用于多个医学成像数据集。结合MedMNIST-C,这建立了一个全面的针对MVLMs鲁棒性评估框架。此外,我们还提出了RobustMedCLIP,这是一种预训练MVLM的视觉编码器适应版本,它通过少量调整来增强对损坏的抵抗能力。通过广泛的实验,我们在五种主要的医学成像模态上对5个主要的MVLM进行了基准测试,结果表明现有的模型在面对损坏时性能显著下降,并且难以处理域和模式之间的权衡。我们的研究强调了多样化训练策略和鲁棒适应策略的重要性,证明当低秩适应与少量调整结合使用时,可以提高鲁棒性同时保持跨模态的泛化能力。
https://arxiv.org/abs/2505.15425
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by \textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.
视觉-语言模型(VLMs)在自动驾驶领域展现出巨大潜力,然而它们面临的幻觉问题、推理效率低下以及现实世界验证不足阻碍了准确感知和稳健的分步推理。为解决这些问题,我们提出了\textbf{AgentThink}——一种开创性的统一框架,首次将链式思维(CoT)推理与动态、代理式的工具调用集成到自动驾驶任务中。AgentThink的核心创新包括: \textbf{(i) 结构化数据生成}:通过建立一个自动驾驶工具库来自动构建结构化的自我验证推理数据,并明确包含各种驾驶场景中的工具使用。 \textbf{(ii) 两阶段训练流水线}:采用监督微调(SFT)结合组相对策略优化(GRPO),使VLM具备自主调用工具的能力。 \textbf{(iii) 代理式工具使用评估}:引入一种新的多工具有效性评估协议,严格评估模型的工具调用和利用情况。 在DriveLMM-o1基准测试上的实验表明,AgentThink显著提高了整体推理分数(提高53.91%)并增强了答案准确性(提升33.54%),同时大幅提升了推理质量和一致性。此外,消融研究及跨多个基准的稳健零样本/少样本泛化实验进一步证实了其强大的能力。 这些发现为开发可信且工具感知型的自动驾驶模型开辟了一条有希望的道路。
https://arxiv.org/abs/2505.15298
Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.
尽管在感知几何领域的适应性方面已经取得了进展,但当前的方法(如GAMA)仍然面临两个未解决的问题:(1) 任务相关和任务不相关的流形维度解耦不足;(2) 刚性的扰动方案忽略了类别对齐的不对称性。为了解决这些问题,我们提出了GAMA++,这是一个新的框架,它引入了(i) 隐空间解耦以隔离标签一致的流形方向与干扰因素,并且(ii) 一个自适应对比扰动策略,该策略针对特定类别的流形曲率和对齐差异量身定制地进行流形内外探索。我们进一步提出了一种跨域对比一致性损失函数,鼓励局部语义簇对齐的同时保持领域内的多样性。我们的方法在DomainNet、Office-Home 和VisDA基准测试中,在标准设置和少量样本设置下均取得了最佳效果,并且在类别级别的对齐精度和边界鲁棒性方面有了显著提升。GAMA++为迁移学习中的语义几何对齐设定了新的标准。
https://arxiv.org/abs/2505.15241
Domain adaptation remains a challenge when there is significant manifold discrepancy between source and target domains. Although recent methods leverage manifold-aware adversarial perturbations to perform data augmentation, they often neglect precise manifold alignment and systematic exploration of structured perturbations. To address this, we propose GAMA (Geometry-Aware Manifold Alignment), a structured framework that achieves explicit manifold alignment via adversarial perturbation guided by geometric information. GAMA systematically employs tangent space exploration and manifold-constrained adversarial optimization, simultaneously enhancing semantic consistency, robustness to off-manifold deviations, and cross-domain alignment. Theoretical analysis shows that GAMA tightens the generalization bound via structured regularization and explicit alignment. Empirical results on DomainNet, VisDA, and Office-Home demonstrate that GAMA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, exhibiting superior robustness, generalization, and manifold alignment capability.
领域适应在源域和目标域之间存在显著的流形差异时仍面临挑战。尽管最近的方法利用了基于流形感知的对抗性扰动来进行数据增强,但它们往往忽视了精确的流形对齐以及结构化扰动的系统探索。为此,我们提出了GAMA(几何感知流形对齐),这是一种通过几何信息引导的对抗性扰动来实现显式流形对齐的结构性框架。GAMA系统地采用切空间探索和流形约束下的对抗优化,同时增强了语义一致性、对脱离流形偏差的鲁棒性和跨域对齐能力。理论分析表明,GAMA通过结构化正则化和显式对齐收紧了泛化界。在DomainNet、VisDA和Office-Home数据集上的实验证明,无论是在无监督还是少样本设置下,GAMA都显著优于现有的对抗方法和适应方法,在鲁棒性、泛化能力和流形对齐能力方面表现出色。
https://arxiv.org/abs/2505.15194
Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.
领域转移下的迁移学习仍然是一个基本挑战,因为源数据和目标数据流形之间的差异会导致模型的泛化能力下降。在本文中,我们提出了MAADA(Manifold-Aware Adversarial Data Augmentation),这是一种新的框架,它将对抗性扰动分解为流形内和流形外两个部分,同时捕捉语义变化和模型脆弱性。从理论上讲,我们证明了强制执行流形内的一致性可以减少假设的复杂性并提高泛化能力,而流形外的正则化可以在低密度区域平滑决策边界。此外,我们引入了一个几何感知对齐损失函数,该函数最小化源域和目标域之间的测地距离差异。 在DomainNet、VisDA以及Office-Home数据集上的实验表明,在无监督和少量样本设置下,MAADA始终优于现有的对抗方法和适应方法,这表明了其优越的结构鲁棒性和跨领域泛化能力。
https://arxiv.org/abs/2505.15191
While large language models (LLMs) have shown great potential across various domains, their applications in robotics remain largely limited to static, prompt-based behaviors and still face challenges in handling complex tasks under zero-shot or few-shot settings. Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental research question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing their ability to perform robotic tasks with minimal demonstrations? In this paper, we present an early-stage framework that integrates metacognitive learning into LLM-powered multi-robot collaboration. The proposed framework equips the LLM-powered robotic agents with a skill decomposition and self-reflection mechanism that identifies modular skills from prior tasks, reflects on failures in unseen task scenarios, and synthesizes effective new solutions. Experimental results show that our metacognitive-learning-empowered LLM framework significantly outperforms existing baselines. Moreover, we observe that the framework is capable of generating solutions that differ from the ground truth yet still successfully complete the tasks. These exciting findings support our hypothesis that metacognitive learning can foster creativity in robotic planning.
尽管大型语言模型(LLM)在各个领域展现出巨大潜力,但在机器人技术中的应用仍主要局限于静态的提示式行为,并且在零样本或少量样本设置下处理复杂任务时仍然面临挑战。受人类元认知学习和创造性问题解决能力的启发,我们通过探索一个基本的研究问题来解决这一限制:是否可以通过赋予LLM元认知能力(即推理、反思和创造的能力),从而增强其在机器人任务中的表现,并且只需少量演示即可? 本文提出了一种早期框架,将元认知学习整合到基于LLM的多机器人协作中。该框架为基于LLM的机器人代理配备了技能分解和自我反省机制,能够从过去的任务中识别出模块化技能,在遇到未见过的任务场景时进行反思,并综合生成有效的解决方案。 实验结果显示,我们的元认知学习增强型LLM框架显著优于现有的基准方法。此外,我们还观察到该框架能够在不依赖于标准答案的情况下自动生成成功的解决方案。这些令人兴奋的发现支持了我们的假设:即元认知学习能够促进机器人规划中的创造力。
https://arxiv.org/abs/2505.14899
Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at this https URL .
有效的提示工程仍然是充分发挥大型语言模型(LLM)能力的核心挑战。虽然设计良好的提示可以显著提升性能,但制作这些提示通常需要专家的直觉和对任务的细致理解。更重要的是,最具影响力的提示往往依赖于细微的语言线索,这类线索可能不易被人类察觉,但却对于指导LLM的行为至关重要。在这篇论文中,我们介绍了PRL(来自强化学习的Prompt),这是一种基于强化学习的自动提示生成新方法。不同于以往的方法,PRL能够产生在训练过程中未见过的新颖少量样本示例。我们的方法在文本分类、简化和摘要等不同基准测试上均取得了最先进的性能表现。在分类任务上,它比APE高出2.58%,比EvoPrompt高出1%;在摘要生成任务中,平均ROUGE得分比APE高4.32,比EvoPrompt高2.12;而在简化任务中,SARI评分比APE高6.93,比EvoPrompt高6.01。我们的代码可在[这里](https://this https URL)获取。
https://arxiv.org/abs/2505.14412