Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
https://arxiv.org/abs/2601.16060
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($\kappa \ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{this https URL}{Code available here: this https URL}
在高风险领域如生物医学中安全部署大规模语言模型(LLMs),需要这些模型能够理解和推理因果关系。我们通过测试13种开源的大规模语言模型在一项基本任务上的能力来研究这一问题:从文本中进行成对的因果发现(PCD)。我们的基准测试使用了12个多样化的数据集,评估两种核心技能: 1. **因果检测**(识别文本中是否包含因果联系) 2. **因果提取**(提取具体的因果短语) 我们测试了几种不同的提示方法,从简单的指令(零样本学习)到更复杂的策略如链式思维(CoT)和少样本上下文学习(FICL)。结果显示目前的模型在这些任务上存在重大缺陷。最佳的检测模型DeepSeek-R1-Distill-Llama-70B仅达到了平均49.57% ($C_{detect}$) 的得分,而最佳提取模型Qwen2.5-Coder-32B-Instruct也只有47.12% ($C_{extract}$)。这些模型在处理简单的、明确的单句关系时表现最好,但在面对更复杂的(且现实的)情况如隐含关系、跨越多句话的关系以及包含多个因果对的文本时,性能急剧下降。 我们提供了一个统一的评估框架,该框架建立在一个具有高注释者间一致性($\kappa \ge 0.758$)的数据集上,并将所有数据、代码和提示公开发布以促进进一步的研究。[此处可访问相关代码](this https URL)
https://arxiv.org/abs/2601.15479
Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.
大型语言模型(LLM)在基本算术方面取得了令人印象深刻的熟练度,在标准数值任务中表现可与人类相媲美。然而,很少有人关注这些模型在数值表达偏离其训练语料库中的现行惯例时的表现如何。在这项工作中,我们研究了各种数字脚本和格式的数值推理能力。研究表明,当数值输入以在其训练语料库中代表性不足的语言或格式呈现时,LLM 的准确性显著下降,尽管底层数学推理是相同的。此外,我们还展示了有针对性的提示策略(如少量样本提示和明确的数字映射)可以大大缩小这一差距。我们的研究结果揭示了多语言数值推理中的一个被忽视的挑战,并为与 LLM 一起工作以可靠地解释、操作和生成不同数字脚本和格式化样式的数字提供了实用见解。
https://arxiv.org/abs/2601.15251
Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.
涵洞和污水管是排水系统中的关键组成部分,它们的故障可能会导致严重的公共安全和环境风险。在这篇论文中,我们探讨了改进涵洞及污水管道缺陷自动分割方法的方法。在这一领域收集并标注数据既费时又需要专业知识,因此建立一个用于结构缺陷检测的大规模数据集是不现实的。我们的研究方法是在有限的标注数据条件下进行测试,以展示其应用于实际场景的可能性。 总的来说,本文提出了三种显著提升缺陷分割性能并在面对数据稀缺问题时有效应对的方法,这些方法可以通过增强训练数据或者调整模型架构来实现。首先,我们评估了预处理策略的效果,包括传统的数据扩充以及动态标签注入等技术,这些技术能够显著提高分割效果,增加了交并比(IoU)和F1分数的值。 其次,本文介绍了FORTRESS这一全新架构,它结合了深度可分离卷积、自适应科莫格罗夫-阿诺尔德网络(KAN)以及多尺度注意力机制。FORTRESS在涵洞污水管道缺陷数据集上取得了最先进的性能,同时显著减少了训练参数的数量及其计算成本。 最后,我们探讨了少量样本语义分割技术的应用性,并将其应用于缺陷检测领域中。少量样本学习的目标是使用仅有的有限数量的数据来训练模型。通过采用双向原型网络和注意力机制,该模型能够获取更为丰富的特征表示,并在评价指标上取得了令人满意的结果。
https://arxiv.org/abs/2601.15366
CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: this https URL.
基于CLIP的前景背景(FG-BG)分解方法在改善少量样本条件下分布外(OOD)检测性能方面已经表现出显著的效果。然而,现有方法仍然存在一些局限性。对于从分解中获得的背景区域,现有的方法采用了一种针对所有补丁的统一抑制策略,忽略了不同补丁对预测的不同贡献。对于前景区域,现有方法未能充分考虑某些局部补丁可能在外观或语义上与其它类别相似的情况,这可能会误导训练过程。 为了解决这些问题,我们提出了一种新的即插即用框架。该框架由三个核心组件组成:(1)前景背景分解模块,它遵循之前的FG-BG方法将图像分离成前景和背景区域;(2)自适应背景抑制模块,根据补丁分类熵进行动态加权;以及(3)混淆前景修正模块,用于识别并纠正可能引起误判的前景补丁。 广泛的实验结果表明,所提出的即插即用框架显著提高了现有FG-BG分解方法的表现。代码可在以下网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2601.15065
Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.
在预算约束下优化广告商累积收益的挑战,在人工智能生成竞价(AIGB)框架中尤为复杂。由于缺乏大量历史互动数据,传统强化学习方法在处理个性化目标时表现不佳,特别是在数据量有限的场景中。大型语言模型(LLM)因其上下文学习能力而成为AIGB的一个有前途的选择,该能力使其能够从少量数据中进行泛化。然而,这些模型在数值精度方面存在不足,无法实现细致优化。 为了克服这一限制,我们提出了一种高效的后训练策略GRPO-自适应(GRPO-Adaptive),通过在训练过程中动态更新参考策略来增强LLM的推理能力和数值精度。在此基础上,我们进一步提出了DARA框架,这是一个新颖的双阶段架构,将决策过程分解为两个阶段:第一个阶段是一个少样本推断器,通过上下文提示生成初步计划;第二个阶段是精细优化器,在反馈驱动下细化这些初步计划。这种分离使DARA能够结合LLM在上下文学习方面的优势与AIGB任务所需的精确适应性。 我们在现实世界和合成数据环境中进行了广泛实验,证明我们的方法在预算约束条件下始终优于现有基线,为广告商累积收益带来了显著提升。
https://arxiv.org/abs/2601.14711
Offline black-box optimization (BBO) aims to find optimal designs based solely on an offline dataset of designs and their labels. Such scenarios frequently arise in domains like DNA sequence design and robotics, where only a few labeled data points are available. Traditional methods typically rely on task-specific proxy or generative models, overlooking the in-context learning capabilities of pre-trained large language models (LLMs). Recent efforts have adapted autoregressive LLMs to BBO by framing task descriptions and offline datasets as natural language prompts, enabling direct design generation. However, these designs often contain bidirectional dependencies, which left-to-right models struggle to capture. In this paper, we explore diffusion LLMs for BBO, leveraging their bidirectional modeling and iterative refinement capabilities. This motivates our in-context denoising module: we condition the diffusion LLM on the task description and the offline dataset, both formatted in natural language, and prompt it to denoise masked designs into improved candidates. To guide the generation toward high-performing designs, we introduce masked diffusion tree search, which casts the denoising process as a step-wise Monte Carlo Tree Search that dynamically balances exploration and exploitation. Each node represents a partially masked design, each denoising step is an action, and candidates are evaluated via expected improvement under a Gaussian Process trained on the offline dataset. Our method, dLLM, achieves state-of-the-art results in few-shot settings on design-bench.
离线黑盒优化(BBO)旨在仅基于设计的离线数据集及其标签来寻找最优设计方案。这种情况在DNA序列设计和机器人技术等领域能够频繁出现,这些领域中仅有少量带有标注的数据点可用。传统的做法通常依赖于特定任务的代理模型或生成模型,而忽略了预训练大规模语言模型(LLMs)中的上下文学习能力。近期的研究将自回归型LLM应用于BBO,通过以自然语言提示的形式构建任务描述和离线数据集来直接生成设计方案。然而,这些设计往往包含双向依赖关系,这是从左到右的模型难以捕捉到的。在本文中,我们探讨了使用扩散型LLM进行BBO的方法,利用其双向建模能力和迭代细化能力。这促使我们开发了一个上下文去噪模块:我们将任务描述和离线数据集(均以自然语言格式呈现)作为条件来约束扩散型LLM,并提示它将掩码设计去噪为改进后的候选方案。为了引导生成向高性能设计方案靠拢,我们引入了掩码扩散树搜索,这是一种逐步的蒙特卡洛树搜索过程,在动态平衡探索与利用之间进行操作。每个节点代表一个部分被掩码的设计,每次去噪步骤被视为一次行动,并通过基于离线数据集训练的高斯进程下的预期改进来评估候选方案。我们的方法dLLM在设计基准测试中的少量样本设置中取得了最先进的成果。
https://arxiv.org/abs/2601.14446
We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.
我们研究了施瓦茨动机连续体中的19种价值观在句子层面的识别,作为文本中人类价值检测的具体形式。该设置涉及来自新闻和政见声明的无上下文句子,这些句子包含稀疏的道德线索以及严重的类别不平衡问题。这种组合使得即使对于强大的现代神经模型,在句子级别的细粒度价值检测也变得非常困难。 我们首先将二元道德存在任务(“是否存在任何价值观?”)进行具体化,并展示了该任务可以从单个句子中学习(正类F1约等于0.74,经过校准的阈值)。然后,我们将带有轻量级信号(前句上下文、LIWC-22/eMFD/MJD词典和主题特征)增强的基础DeBERTA模型在同等计算资源下进行存在性过滤的层次结构与直接多标签分类器进行了比较。结果表明,层次结构并未超过直接预测效果,这说明门控召回限制了后续改进的可能性。 我们还对几种指令调优的大语言模型(Gemma 2 9B、Llama 3.1 8B、Mistral 8B 和 Qwen 2.5 7B)进行了零样本/少样本和QLoRA设置的基准测试,并构建了简单的集成系统。其中,通过软投票监督集成达到了0.332的宏观F1值,显著超过了最佳单一模型的表现并超越了之前仅限英语的基础线。 总体而言,在这种情况下,轻量级信号和小规模集成系统产生了最可靠的改进效果,而层次性门控提供的益处有限。我们认为,在8GB单GPU限制下以及在7-9B参数规模内,精心调优的监督编码器仍然是结构化人类价值观检测的强大且计算效率高的基准,并概述了如何通过更丰富的价值结构和上下文(句子在其文档中的位置)进一步提高性能的方法。
https://arxiv.org/abs/2601.14172
Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at this https URL
少量样本学习(Few-shot learning)旨在仅通过少数标记样本识别新类别,但由于数据稀缺,从这些有限的数据中估计出的原型常常存在偏差,并且泛化性能较差。基于语义的方法通过引入粗粒度的类级别信息来缓解这一问题,但它们主要应用于支持集方面,而未改变查询表示。 在本文中,我们提出了PMCE(Probabilistic few-shot framework with Multi-granularity semantics and Caption-guided Enhancement),这是一个利用多粒度语义和基于描述指导增强的概率少量样本学习框架。PMCE构建了一个非参数知识库,该知识库存储了每个类别的视觉统计信息以及基础类别中的CLIP编码的类别名称嵌入。在元测试阶段,根据新类别的类别名称嵌入相似性检索最相关的基础类别。随后,将这些统计数据聚合为特定于类别的先验信息,并通过简单的MAP更新与支持集原型融合。 同时,一个冻结的BLIP描述器提供无标签的实例级图像描述,而基于基础类训练的一个轻量级增强器在归纳协议下优化了支持原型和查询特征,并使用一致性正则化来稳定噪声描述。实验结果表明,在四个基准数据集上,PMCE相对于强大的基线方法持续改进,在MiniImageNet的一次性设置中相较于最强的语义竞争者实现了高达7.71%的绝对收益。我们的代码可在上述链接获取。 这段翻译解释了PMCE框架如何通过结合多粒度语义信息和基于描述的增强,来提高少量样本学习中的性能,并详细介绍了该方法的工作原理及其在几个基准数据集上的实验效果。
https://arxiv.org/abs/2601.14111
Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
尽管基于3D高斯模型的头像建模在最近取得了进展,但高效生成高质量的化身仍然是一项挑战。当前的方法通常依赖于复杂的多视角捕捉设备或单目视频,并且在推理过程中需要针对每个身份进行个性化的优化,这限制了它们在未见过的主题上的可扩展性和易用性。为了克服这些效率问题,我们提出了一种名为\OURS的方法,这是一种前馈方法,可以从少量输入图像中生成高质量的高斯头像,并支持实时动画效果。 我们的方法直接从输入图像中学习每个像素的高斯表示,并利用基于Transformer的编码器聚合多视角信息,该编码器融合了DINOv3和Stable Diffusion VAE中的图像特征。为了实现实时动画,我们通过为每种高斯分布添加特定属性的特征来扩展显式的高斯表示,并引入了一个轻量级的MLP(多层感知机)网络以从表情代码中预测3D高斯变形。 此外,为了增强三维头部的几何平滑度,我们利用了预训练的大规模重建模型提供的点映射作为几何指导。实验表明,我们的方法在渲染质量和推理效率方面均显著优于现有技术,并且支持实时动态头像动画效果。
https://arxiv.org/abs/2601.13837
Vision-Language Models (VLMs), particularly CLIP, have revolutionized anomaly detection by enabling zero-shot and few-shot defect identification without extensive labeled datasets. By learning aligned representations of images and text, VLMs facilitate anomaly classification and segmentation through natural language descriptions of normal and abnormal states, eliminating traditional requirements for task-specific training or defect examples. This project presents a comprehensive analysis of VLM-based approaches for anomaly classification (AC) and anomaly segmentation (AS). We systematically investigate key architectural paradigms including sliding window-based dense feature extraction (WinCLIP), multi-stage feature alignment with learnable projections (AprilLab framework), and compositional prompt ensemble strategies. Our analysis evaluates these methods across critical dimensions: feature extraction mechanisms, text-visual alignment strategies, prompt engineering techniques, zero-shot versus few-shot trade-offs, computational efficiency, and cross-domain generalization. Through rigorous experimentation on benchmarks such as MVTec AD and VisA, we compare classification accuracy, segmentation precision, and inference efficiency. The primary contribution is a foundational understanding of how and why VLMs succeed in anomaly detection, synthesizing practical insights for method selection and identifying current limitations. This work aims to facilitate informed adoption of VLM-based methods in industrial quality control and guide future research directions.
视觉-语言模型(VLM),尤其是CLIP,通过在无需大量标记数据集的情况下实现零样本和少量样本缺陷识别,彻底革新了异常检测。通过学习图像与文本之间的对齐表示,VLM使用户能够借助自然语言描述正常和异常状态来进行异常分类和分割,并且不需要特定任务的训练或故障示例。本项目全面分析了基于VLM的方法在异常分类(AC)和异常分割(AS)中的应用。我们系统地研究了几种关键架构范式,包括滑动窗口密集特征提取(WinCLIP)、多阶段可学习投影的特征对齐(AprilLab框架),以及组合提示集成策略。 我们的分析从多个维度评估这些方法:特征提取机制、文本-视觉对齐策略、提示工程技巧、零样本与少量样本之间的权衡、计算效率和跨域泛化能力。通过在MVTec AD和VisA等基准测试上的严格实验,我们比较了分类精度、分割精确度以及推理效率。 本工作的主要贡献是为VLM如何及为何能在异常检测中取得成功提供了一个基础性的理解,并总结出实际应用中的方法选择洞察与当前局限性。这项工作旨在促进工业质量控制领域基于VLM方法的明智采用,同时也为未来的研究方向提供建议和指导。
https://arxiv.org/abs/2601.13440
Learning in data-scarce settings has recently gained significant attention in the research community. Semi-supervised object detection(SSOD) aims to improve detection performance by leveraging a large number of unlabeled images alongside a limited number of labeled images(a.k.a.,few-shot learning). In this paper, we present a comprehensive comparison of three state-of-the-art SSOD approaches, including MixPL, Semi-DETR and Consistent-Teacher, with the goal of understanding how performance varies with the number of labeled images. We conduct experiments using the MS-COCO and Pascal VOC datasets, two popular object detection benchmarks which allow for standardized evaluation. In addition, we evaluate the SSOD approaches on a custom Beetle dataset which enables us to gain insights into their performance on specialized datasets with a smaller number of object categories. Our findings highlight the trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes.
在数据稀缺环境下进行学习最近引起了研究社区的广泛关注。半监督目标检测(SSOD)旨在通过利用大量未标记图像和少量已标注图像(即,少样本学习)来提高检测性能。在这篇论文中,我们对三种最先进的SSOD方法进行了全面比较,包括MixPL、Semi-DETR 和 Consistent-Teacher,并且我们的研究目的是了解这些方法在不同数量的标签图片下的表现差异。我们在实验中使用了两个流行的目标检测基准数据集——MS-COCO和Pascal VOC,以实现标准化评估。此外,我们还在一个自定义的甲壳虫(Beetle)数据集上对SSOD方法进行了评估,该数据集使我们可以了解这些方法在具有较少对象类别的专业化数据集上的性能表现。我们的研究结果强调了准确率、模型大小和延迟之间的权衡,并提供了关于哪些方法最适合低数据环境的见解。
https://arxiv.org/abs/2601.13380
Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: this https URL.
裂缝检测对于混凝土基础设施的安全至关重要,但在隧道和桥下等低光照环境中出现的裂缝会降低计算机视觉分割精度。对低光环境下裂纹图像进行像素级别的标注极其耗时,而大多数深度学习方法需要大规模且照明良好的数据集。我们提出了一种基于Retinex理论与少量样本学习相结合的双分支原型网络,用于低光环境下的裂缝分割。该网络利用基于Retinex的反射成分来指导光照不变性的全局表征学习,并通过度量学习减少对大量标注数据集的依赖。此外,我们引入了一个跨相似性先验掩码生成模块,计算查询特征和支持特征之间的高维相似性,以捕捉裂纹的位置和结构;还设计了多尺度特征增强模块,将多尺度特征与先验掩模融合,从而缓解空间不一致性问题。在多个基准测试中的广泛实验表明,在低光条件下该方法具有持续的优越性能。代码链接:[请在此处插入实际URL]。 (注:原文中提到的具体链接地址未提供,可以替换成实际可用的代码库或资源链接)
https://arxiv.org/abs/2601.13059
While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model's fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model's unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints' descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model's specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.
尽管大型语言模型(LLMs)能够生成高度细腻的文本简化,但开发者目前缺乏一套全面、高效且可重复的方法来诊断这些模型的行为。本文介绍了"Simplification Profiler"这一诊断工具包,它可以为简化的文本生成一个多维度和可解释的特征指纹。一个模型产生的多个聚合简化结果共同构成了该模型的独特指纹。这种新颖的评估方法对于数据稀缺问题被放大了的语言特别重要,即在创建面向多样化目标群体而非单一固定简化风格的灵活模型时尤为重要。我们提出,在这一背景下,衡量模型独特的行为签名比与人类偏好相关的度量更有意义。 为了实现这一点,我们通过我们的特征指纹描述能力的实际元评估来操作化这一想法,这种方法避免了对大规模人工标记数据集的需求。这项测试旨在测量一个简单的线性分类器是否能够可靠地根据所创建的简化文本识别不同的模型配置,从而确认我们的指标对于特定模型特性具有敏感性。 Profiler 可以区分不同提示策略之间的高层次行为变化,并且可以详细分辨来自提示工程(包括少量示例)的具体变化。我们的完整特征集实现了高达71.9% 的分类F1得分,比简单的基准提高了48个百分点以上。 因此,Simplification Profiler 为开发者提供了细致而可操作的分析工具,帮助他们构建更有效且真正适应性强的文本简化系统。
https://arxiv.org/abs/2601.13050
Few-shot learning in remote sensing remains challenging due to three factors: the scarcity of labeled data, substantial domain shifts, and the multi-scale nature of geospatial objects. To address these issues, we introduce Adaptive Multi-Scale Correlation Meta-Network (AMC-MetaNet), a lightweight yet powerful framework with three key innovations: (i) correlation-guided feature pyramids for capturing scale-invariant patterns, (ii) an adaptive channel correlation module (ACCM) for learning dynamic cross-scale relationships, and (iii) correlation-guided meta-learning that leverages correlation patterns instead of conventional prototype averaging. Unlike prior approaches that rely on heavy pre-trained models or transformers, AMC-MetaNet is trained from scratch with only $\sim600K$ parameters, offering $20\times$ fewer parameters than ResNet-18 while maintaining high efficiency ($<50$ms per image inference). AMC-MetaNet achieves up to 86.65\% accuracy in 5-way 5-shot classification on various remote sensing datasets, including EuroSAT, NWPU-RESISC45, UC Merced Land Use, and AID. Our results establish AMC-MetaNet as a computationally efficient, scale-aware framework for real-world few-shot remote sensing.
在遥感领域中,少量样本学习(few-shot learning)仍面临三大挑战:标记数据的稀缺性、显著的域偏移以及地理空间对象的多尺度特性。为了解决这些问题,我们引入了自适应多尺度相关元网络(Adaptive Multi-Scale Correlation Meta-Network, AMC-MetaNet),这是一种轻量级但功能强大的框架,具有三项关键创新: 1. 相关引导特征金字塔:用于捕获尺度不变的模式。 2. 自适应通道相关模块(ACCM):用于学习动态跨尺度关系。 3. 基于相关性的元学习:利用相关性模式而非传统的原型平均。 与以往依赖重型预训练模型或变换器的方法不同,AMC-MetaNet从头开始训练,仅使用约60万个参数。这使得它比ResNet-18少20倍的参数量,同时保持了高效性(每张图像推理时间低于50毫秒)。在包括EuroSAT、NWPU-RESISC45、UC Merced Land Use和AID在内的多个遥感数据集上,AMC-MetaNet实现了高达86.65%的精度,在五类五样本分类任务中表现优异。我们的研究结果将AMC-MetaNet确立为一种计算效率高且尺度感知度强的框架,适用于现实世界的少量样本遥感应用。
https://arxiv.org/abs/2601.12308
The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.
自主机器人车辆的安全验证依赖于对其规划和控制系统进行系统性测试,尤其是在罕见的、关键安全场景下。因此,从大规模的真实世界驾驶日志中挖掘这些长尾事件是机器人开发生命周期中的一个关键步骤。情景挖掘任务的目标是从自然语言描述的情景中检索有用的信息,以支持针对特定目标的重新模拟、回归测试和故障分析。 RefAV是由Argoverse团队提出的一种端到端框架,该框架使用大型语言模型(LLMs)来空间和时间定位用自然语言描述的情景。然而,这一过程基于轨迹标签进行检索,忽略了自然语言与原始RGB图像之间的直接联系,这违背了视频检索的直觉;此外,它还依赖于上游3D目标检测和跟踪的质量。不准确的轨迹数据会导致下游的空间和时间定位不准确。 为了解决这些问题,我们提出了从粗到细的情景挖掘方法(SMc2f),这是一种采用视觉-语言模型(VLMs)进行图像文本过滤、在RefAV基础上构建成功挖掘案例数据库并自动检索实例以对LLM进行少样本条件化从而提高检索鲁棒性,并引入了基于文本和轨迹的对比学习,将匹配对拉近并在共享嵌入空间中将不匹配对推开的方法。这种方法最终形成了一种细粒度匹配器,用于精炼LLM候选轨迹。 在公共数据集上的实验表明,在检索质量和效率方面取得了显著提升。
https://arxiv.org/abs/2601.12010
Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.
手写动作可以被用作一种独特的行为生物识别方式,以验证实际用户是否在操作设备或应用程序。这一任务可以被视为一个反向图灵测试,在这种测试中,计算机需要检测输入实例是由真人生成的还是人工合成的。为了解决这个问题,我们研究了十个公开的手写符号数据集(包括孤立字符、数字、手势、指点轨迹和签名),这些数据集使用七种不同的生成器进行了人工再现,其中包括动学理论(Sigma h 模型)、生成对抗网络、Transformer 和扩散模型等。 我们训练了一个浅层循环神经网络,在不进行特征化处理的情况下以轨迹数据为输入,该网络在所有合成器和数据集中实现了卓越的性能(98.3% 的接收者操作特性曲线(ROC)面积得分和1.4% 的平均错误接受率)。此外,我们在少量样本设置下展示了当训练集仅为总数据量的10%,而剩余90%的数据作为测试集时,我们的分类器也能实现如此卓越的表现。我们进一步在跨域场景中挑战了这一分类器,并观察到了非常具有竞争力的结果。 这项工作对需要验证人类存在的计算机系统有着重要的影响,并为防止攻击者侵入添加了一层额外的安全保障。
https://arxiv.org/abs/2601.11700
Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbf{multi-task learning (MTL)} as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf{44\%}, \textbf{54\%}, and \textbf{31\%} relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.
大型语言模型(LLMs)通过实现统一的端到端验证流水线而非孤立组件,正在重塑自动化事实核查(AFC)。尽管大型专有模型在性能上表现出色,但其封闭权重、复杂性和高成本限制了可持续性。为特定的AFC任务微调较小的开源模型可以有所帮助,但这需要多个专业化的模型,从而导致高昂的成本。我们提出了一种更高效的替代方案——**多任务学习(MTL)**,该方法通过联合训练单一模型来执行声明检测、证据排序和立场检测任务。使用小型解码器模型(例如Qwen3-4b),我们探索了三种MTL策略:分类头、因果语言建模头以及指令调优,并在不同规模的模型、任务顺序以及非LLM的标准基准线方面对其进行了评估。虽然多任务模型并不总是超越单一任务基线,但它们确实带来了显著改进,在零/少量样本设置下分别实现了**44%**、**54%**和**31%**的相对提升,对应于声明检测、证据重新排序以及立场检测。最后,我们还提供了实用且基于实证的研究指导方针,以帮助实践者利用LLM进行自动化事实核查中的多任务学习应用。
https://arxiv.org/abs/2601.11293
The rapid evolution of satellite-borne Earth Observation (EO) systems has revolutionized terrestrial monitoring, yielding petabyte-scale archives. However, the immense computational and storage requirements for global-scale analysis often preclude widespread use, hindering planetary-scale studies. To address these barriers, we present Embedded Seamless Data (ESD), an ultra-lightweight, 30-m global Earth embedding database spanning the 25-year period from 2000 to 2024. By transforming high-dimensional, multi-sensor observations from the Landsat series (5, 7, 8, and 9) and MODIS Terra into information-dense, quantized latent vectors, ESD distills essential geophysical and semantic features into a unified latent space. Utilizing the ESDNet architecture and Finite Scalar Quantization (FSQ), the dataset achieves a transformative ~340-fold reduction in data volume compared to raw archives. This compression allows the entire global land surface for a single year to be encapsulated within approximately 2.4 TB, enabling decadal-scale global analysis on standard local workstations. Rigorous validation demonstrates high reconstructive fidelity (MAE: 0.0130; RMSE: 0.0179; CC: 0.8543). By condensing the annual phenological cycle into 12 temporal steps, the embeddings provide inherent denoising and a semantically organized space that outperforms raw reflectance in land-cover classification, achieving 79.74% accuracy (vs. 76.92% for raw fusion). With robust few-shot learning capabilities and longitudinal consistency, ESD provides a versatile foundation for democratizing planetary-scale research and advancing next-generation geospatial artificial intelligence.
地球观测(EO)卫星系统的快速演化已经彻底改变了陆地监测,产生了规模达PB级的数据存档。然而,全球范围分析所需的巨大计算和存储需求往往限制了其广泛应用,阻碍了行星尺度的研究进展。为了解决这些问题,我们提出了嵌入式无缝数据(ESD),这是一个超轻量级的、30米分辨率的全球地球嵌入数据库,涵盖了从2000年到2024年的25年期间的数据。 通过将来自陆地卫星系列(包括5号、7号、8号和9号)以及MODIS Terra传感器的高维度多传感器观测数据转换为信息密集且量化的潜在向量,ESD能够提炼出重要的地球物理及语义特征,并将其整合到统一的潜在空间中。利用ESDNet架构和有限标量量化(FSQ),该数据库实现了与原始存档相比约340倍的数据体积压缩。这一压缩使得全球陆地表面一年内的数据可以封装在大约2.4 TB的空间内,从而可以在标准本地工作站上执行跨十年的全球分析。 严格的验证表明,重建精度极高(均方误差:0.0130;根均方误差:0.0179;皮尔森相关系数:0.8543)。通过将年度物候周期简化为12个时间步骤,这些嵌入式数据提供了固有的降噪效果,并且在语义组织方面超越了原始反射率,在土地覆盖分类中达到了79.74%的准确度(而基于原始融合的数据仅为76.92%)。 凭借强大的少样本学习能力和纵向一致性,ESD为普及行星尺度的研究和推进下一代地理空间人工智能提供了灵活的基础。
https://arxiv.org/abs/2601.11183