Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
水下声学目标识别(UATR)对于保护海洋生物多样性和国家安全具有重要意义。深度学习的发展为UATR提供了新的机遇,但同时也面临着参考样本稀缺和复杂环境干扰带来的挑战。为了应对这些问题,我们提出了一种多任务平衡通道注意卷积神经网络(MT-BCA-CNN)。该方法结合了通道注意力机制与多任务学习策略,构建了一个共享特征提取器及多个任务分类器,以同时优化目标分类和特征重构的任务。通过动态增强如谐波结构等具有区分性的声学特征并抑制噪声,通道注意机制能够显著提高模型性能。 在Watkins海洋生物数据集上的实验表明,在27类少量样本的场景中,MT-BCA-CNN达到了97%的分类准确率和95%的F1值,远远优于传统的CNN、ACNN模型及流行的新一代UATR方法。消融研究证实了多任务学习与注意力机制相结合所具有的协同效应,而动态权重调整策略有效地平衡了各任务贡献。这项工作为少量样本情况下的水下声学识别提供了一种高效解决方案,并推进了海洋生物声学和声呐信号处理领域的研究进展。
https://arxiv.org/abs/2504.13102
Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.
大规模枪击事件对公共安全构成了重大挑战,产生了大量非结构化的文本数据,阻碍了有效的调查和政策制定。尽管情况紧急,但很少有先前的研究能够有效地利用自动提取技术从这些事件中获取关键信息以支持法律和调查工作。本文介绍了一个通过应用命名实体识别(NER)技术来获取大规模枪击事件知识的首个数据集。该研究集中于识别诸如罪犯、受害者、地点及犯罪工具等对法律和调查至关重要的实体。NER过程利用大型语言模型(LLM),并通过少量样本提示法实现关键信息的有效提取与组织,这些信息来源于包括新闻报道、警方报告和社会媒体在内的多种渠道。 在实际大规模枪击事件语料库上的实验结果显示,GPT-4o是进行大规模枪击事件命名实体识别最有效的模型,它在微平均精度(Micro Precision)、微平均召回率(Micro Recall)和微平均F1分数(Micro F1-score)上均取得了最高分。同时,o1-mini表现出竞争力的性能,使其成为复杂性较低任务中的资源高效替代选择。观察还发现增加样本数量可以提高所有模型的表现,但对GPT-4o和o1-mini而言,性能提升更为显著,这突显了它们在少量样例学习场景下的优越适应能力。
https://arxiv.org/abs/2504.12545
In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data} and that high-quality few-shot prompts enhance LLMs' ability to mimic the vocabulary diversity of online forums.
在我们的论文中,我们探讨了逻辑谬误的定义及其在社交媒体上自动检测操纵行为中的应用。特别是,我们研究了这些逻辑谬误如何在现实世界中呈现,例如互联网论坛上的情况。我们在专门围绕乌克兰与俄罗斯冲突的讨论板块中发现了大量错误信息和误导性意图的现象,这使我们的任务范围更加具体化。尽管最近自动识别逻辑谬误的问题引起了广泛关注,但大多数数据集使用的往往是未经规范化的谬误分类法或局限于正式语言环境如政治辩论或新闻报道等领域。然而,在线对话往往使用非标准化且多样化的语言,这些在上述领域中是无法捕捉到的。 为此,我们提出了“阴暗语言表述复制-生成”(Shady Linguistic Utterance Replication-Generation, SLURG)方法来解决这些问题,并探讨了利用大型语言模型(LLMs),特别是DeepHermes-3-Mistral-24B这类技术,在线论坛式评论中生成合成谬误的可行性。我们的研究结果表明,这些语言模型能够复制真实数据中的句法模式,并且高质量的少样本提示可以增强大型语言模型模仿在线论坛词汇多样性的能力。
https://arxiv.org/abs/2504.12466
Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.
将视觉-语言模型(VLMs)适应到新领域且仅有少量标注样本的情况,仍然面临着严重的过拟合和计算资源限制的挑战。尽管现有的先进解决方案如低秩重参数化能够缓解这些问题,但它们在泛化能力和超参数调整方面通常表现不佳。本文提出了一种新的稀疏优化(Sparse Optimization, SO)框架。与传统的低秩方法仅限于在一个固定子空间内约束更新不同,我们的SO方法利用高斯稀疏性来动态地微调少量的参数。 我们提出了两种关键的方法范式: 1. **局部稀疏性和全局稠密性**:这种方法在每次迭代中仅更新极少数的参数集合,同时保持整个模型的表现力不受影响。 2. **局部随机性和全局重要性**:该方法通过使用随机选择来稀释梯度,并基于重要性进行第一矩剪枝。这种组合显著地减少了过拟合问题,并确保了在低数据条件下稳定的学习适应能力。 广泛的实验在11个多样化的数据集上进行了验证,结果显示SO框架不仅实现了最先进的少量样本适配性能,还大大减少了内存开销。
https://arxiv.org/abs/2504.12436
With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at this https URL.
凭借其强大的视觉-语言对齐能力,CLIP在零样本和少量样本学习任务中表现出色。然而,在实验中我们发现,CLIP的logits(即模型预测前的分数)在下游任务中存在严重的类间混淆问题,这种类别之间的模糊性严重降低了准确率。为了解决这一挑战,我们提出了一种名为“Logits DeConfusion”(消除Logits混淆)的新方法。该方法结合了我们的多级适配器融合(MAF)模块和类间消除混淆(ICD)模块,有效学习并消除了logits中的类间混淆问题。我们的MAF从不同层级提取特征,并通过统一融合来增强特征表示。而ICD则利用残差结构可学习地减少logits的类间混淆问题。实验结果表明,我们的方法可以显著提高分类性能,并缓解类间混淆的问题。代码可在[此处](https://this https URL)获取。(请将“this https URL”替换为实际链接地址)
https://arxiv.org/abs/2504.12104
Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at this https URL.
给定一个标注示例,情境分割(in-context segmentation)旨在对相应对象进行分割。这一设置被称为少样本学习中的“一次分割”,它探索了分割模型的泛化能力,并已被应用于包括场景理解、图像/视频编辑在内的各种视觉任务中。尽管最近的Segment Anything Models在交互式分割方面取得了最先进的成果,但这些方法并不直接适用于情境分割。在这项工作中,我们基于提示微调(prompt-tuning)提出了双一致性SAM(DC-SAM)方法,以适应SAM和SAM2进行图像和视频的情境分割。我们的主要见解是通过提供高质量的视觉提示来增强SAM提示编码器在分割中的特征表现力。当生成掩码先验时,我们将SAM特征融合起来,以便更好地对齐提示编码器。然后,在融合后的特征上设计了一个循环一致性的交叉注意力机制,并基于最初的视觉提示进行操作。接着,我们通过使用提示编码器中具有区分性的正向和负向提示来提供双分支设计。此外,我们还设计了一种简单的掩码管训练策略,以便将我们的双重一致性方法应用于掩码管。虽然提出的DC-SAM主要针对图像进行了设计,但借助SAM2的支持可以无缝地扩展到视频领域。 鉴于在视频领域缺乏情境分割的基准,我们手动策划并构建了一个从现有视频分割数据集中提取的第一个基准——称为情境视频目标分割(IC-VOS),以更好地评估模型的情境能力。广泛的实验表明,我们的方法在COCO-20i上实现了55.5 (+1.4) 的mIoU,在PASCAL-5i上达到了73.0 (+1.1) 的mIoU,并且在我们提出的IC-VOS基准上的J&F评分为71.52。我们的源代码和基准可以在提供的URL中找到。 【注】原文中的"this https URL"可能是指链接到项目主页或相关资源的地址,此处未提供具体网址。
https://arxiv.org/abs/2504.12080
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at this https URL.
少量样本异常检测(FSAD)在工业检查中已成为一个关键但具有挑战性的任务,其中必须仅通过少数正常图像来完成正常分布的建模。尽管现有方法通常采用结合语言和视觉模式的多模态基础模型来进行提示引导型异常检测,这些方法往往需要复杂的提示工程和大量的手动调整。在这篇论文中,我们展示了简单最近邻搜索框架在单一类别和多类别FSAD场景下可以超越最先进的性能。我们的方法VisionAD包括四个简单但关键的组件:(1)可扩展的视觉基础模型,提取通用且有区别的特征;(2)双增强策略——支持数据增强以提高特征匹配适应性以及查询数据增强以解决单视图预测中的遗漏问题;(3)多层特征整合,在最小计算开销下捕捉低频全局上下文和高频局部细节;(4)类别感知视觉记忆库,使高效的一对所有类别的检测成为可能。在MVTec-AD、VisA和Real-IAD基准测试上的广泛评估证明了VisionAD的卓越性能。使用仅1张正常图像作为支持数据,我们的方法分别达到了97.4%、94.8% 和70.8% 的显著图级别AUROC得分,比当前最先进的方法提高了(+1.6%,+3.2%,和+1.4%)。VisionAD的无训练特性和卓越的少量样本能力尤其适用于样本稀少或获取成本高昂的实际应用场景。代码可在[提供的URL]访问。
https://arxiv.org/abs/2504.11895
Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.
尽管学习和计算机视觉算法迅速发展,细粒度分类(FGC)在许多实际应用中仍然面临开放性问题。例如,在零售领域,识别快速变化且外观高度相似的产品及其属性是自动化价格监控和产品推荐的关键。本文提出了一种新的基于视觉的检索增强生成(Visual RAG)管道,该管道结合了检索增强生成(RAG)方法与视觉语言模型(VLMs),用于少量样本下的细粒度分类任务。这种视觉RAG管道可以从各种零售商的广告传单中提取产品和促销数据,并同时预测精细的产品ID以及价格和折扣信息。 相比于之前的方法,视觉RAG管道的关键特点在于它能够在不重新训练的情况下预测新型产品,只需向RAG数据库添加少量的新类别样本即可实现这一功能。通过比较几个VLM后端模型如GPT-4o[23]、GPT-4o-mini[24]和Gemini 2.0 Flash[10],我们的方法在多样化的数据集上实现了86.8%的准确率。
https://arxiv.org/abs/2504.11838
The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
生成式人工智能的迅速发展促进了内容创作,并使图像操纵变得更加容易,同时也更难以检测。虽然多模态大型语言模型(LLM)编码了丰富的世界知识,但它们本身并不专门用于对抗AI生成的内容(AIGC),并且在理解局部伪造细节方面存在困难。在这项工作中,我们探讨了将多模态LLMs应用于伪造检测的应用。我们提出了一种框架,该框架能够根据语义篡改线索评估图像的真实性、定位被篡改的区域、提供证据,并追踪生成方法。我们的方法证明,通过精心设计的提示工程和少样本学习技术的应用,可以有效解锁大型语言模型在伪造分析中的潜力。我们进行了定性和定量实验,并展示了GPT4V在Autosplice数据集上达到了92.1%的准确率,在LaMa数据集上达到了86.3%,这与当前最先进的AIGC检测方法具有竞争力。此外,我们还讨论了多模态LLMs在此类任务中的局限性并提出潜在改进方案。
https://arxiv.org/abs/2504.11686
Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tuning may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.
从各种指令微调或后期训练步骤中获得的指令模型通常被认为优于其基础版本,更具实用性。虽然模型在遵循指令方面的能力得到了提升,但指令微调可能导致遗忘预训练的知识,或者促使模型变得过于对话化或冗长。这反过来可能会导致上下文中的少量样本学习性能下降。 在这项工作中,我们通过使用部分适应方法降低指令微调的强度来研究基础模型和指令模型之间的性能轨迹。我们发现,在多个模型家族和不同规模的模型中,减少指令微调的强度可以在涵盖多种经典自然语言任务的少量样本上下文学习基准上实现显著改进。这一改进是以在AlpacaEval测量的部分遵循指令能力方面的一些损失为代价的。 我们的研究揭示了上下文学习能力和遵循指令能力之间可能存在的一种权衡关系,在实践中值得考虑。
https://arxiv.org/abs/2504.11626
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
我们介绍了TerraMind,这是第一款针对地球观测(EO)的任意到任意生成式多模态基础模型。与其它多模态模型不同,TerraMind在预训练阶段结合了双尺度表示,包括跨模式中的标记级和像素级数据。在标记级别上,TerraMind编码高层次上下文信息以学习跨模式关系;而在像素级别上,则利用细粒度表达来捕捉关键的空间细节。我们在一个全球大规模数据集的九种地理空间模态上对TerraMind进行了预训练。 在这篇论文中,我们展示了: (i) TerraMind的双尺度早期融合方法解锁了一系列零样本和少量样本的应用场景,用于地球观测; (ii) TerraMind引入了“跨模式思考”(TiM)——一种在微调和推理过程中生成额外人工数据以改进模型输出的能力; (iii) 在诸如PANGAEA等社区标准基准测试中,TerraMind实现了超越现有最佳水平的性能。 预训练数据集、模型权重及我们的代码均已在许可下开源。
https://arxiv.org/abs/2504.11171
Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.
最近,监督学习在场景文本分割领域取得了迅速发展。然而,高质量数据集的缺乏和像素标注成本高昂极大地限制了该领域的进步。鉴于针对下游任务表现良好的少量样本学习方法,我们研究了将少量样本学习方法应用于场景文本分割的应用。为此,我们提出了TSAL(Text Segmentation with Adaptive Prompt Learning),它利用CLIP的先验知识来学习用于分割的文字属性。为了充分利用图像中的语义和纹理信息,提出了一种视觉引导分支,以分别提取文字和背景特征。 为减少对数据的依赖并提高文本检测精度,适应性提示引导分支采用了有效的自适应提示模板来捕捉各种文字属性。为了让自适应提示能够捕获独特且复杂的文字特征及背景分布,我们提出了自适应特征对齐模块(Adaptive Feature Alignment, AFA)。通过将不同属性的学习令牌与视觉特征和提示原型进行对齐,AFA使自适应提示能够同时捕捉通用和独特的属性信息。 TSAL能够在仅使用少量图像的情况下捕获文本的独特属性,并实现精确的分割。实验表明,在多个文本分割数据集下的少量样本设置中,我们的方法达到了最先进的性能,并在与文本相关的领域展示了巨大的潜力。
https://arxiv.org/abs/2504.11164
Spatial imbalances in crop type data pose significant challenges for accurate classification in remote sensing applications. Algorithms aiming at transferring knowledge from data-rich to data-scarce tasks have thus surged in popularity. However, despite their effectiveness in previous evaluations, their performance in challenging real-world applications is unclear and needs to be evaluated. This study benchmarks transfer learning and several meta-learning algorithms, including (First-Order) Model-Agnostic Meta-Learning ((FO)-MAML), Almost No Inner Loop (ANIL), and Task-Informed Meta-Learning (TIML), on the real-world EuroCropsML time series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to simpler transfer learning methods when applied to crop type classification tasks in Estonia after pre-training on data from Latvia. However, this improvement comes at the cost of increased computational demands and training time. Moreover, we find that the transfer of knowledge between geographically disparate regions, such as Estonia and Portugal, poses significant challenges to all investigated algorithms. These insights underscore the trade-offs between accuracy and computational resource requirements in selecting machine learning methods for real-world crop type classification tasks and highlight the difficulties of transferring knowledge between different regions of the Earth. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating transfer and meta-learning methods for crop type classification under real-world conditions. The corresponding code is publicly available at this https URL.
作物类型数据的空间不平衡给遥感应用中的准确分类带来了重大挑战。旨在将知识从数据丰富的任务转移到数据稀缺的任务的算法因此变得越来越受欢迎。然而,尽管它们在之前的评估中表现出有效性,但这些方法在具有挑战性的现实世界应用程序中的性能仍然不清楚,需要进行进一步评价。 本研究使用包含爱沙尼亚、拉脱维亚和葡萄牙农民报告的作物数据以及Sentinel-2卫星观测的EuroCropsML时间序列数据集,对迁移学习及几种元学习算法(包括模型无关的元学习((FO)-MAML)、几乎无内循环(ANIL) 和任务信息元学习(TIML))进行了基准测试。 我们的研究结果表明,在爱沙尼亚应用预训练于拉脱维亚数据上的作物类型分类任务时,基于MAML的元学习算法比简单的迁移学习方法表现出略高的准确性。然而,这种性能提升是以增加计算需求和训练时间为代价的。此外,我们发现地理上分离地区(如爱沙尼亚与葡萄牙)之间的知识转移对所有研究算法构成了重大挑战。 这些见解强调了在选择现实世界作物类型分类任务中的机器学习方法时,在准确性和计算资源要求之间存在的权衡,并突显了在全球不同区域间传输知识的困难。为了促进这一领域的未来研究,我们提供了第一个全面基准测试,用于评估在现实条件下进行作物类型分类的迁移和元学习方法的效果。相关的代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2504.11022
Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from a limited set of training samples without forgetting knowledge of previously learned classes. Conventional FSCIL methods typically build a robust feature extractor during the base training session with abundant training samples and subsequently freeze this extractor, only fine-tuning the classifier in subsequent incremental phases. However, current strategies primarily focus on preventing catastrophic forgetting, considering only the relationship between novel and base classes, without paying attention to the specific decision spaces of each class. To address this challenge, we propose a plug-and-play Adaptive Decision Boundary Strategy (ADBS), which is compatible with most FSCIL methods. Specifically, we assign a specific decision boundary to each class and adaptively adjust these boundaries during training to optimally refine the decision spaces for the classes in each session. Furthermore, to amplify the distinctiveness between classes, we employ a novel inter-class constraint loss that optimizes the decision boundaries and prototypes for each class. Extensive experiments on three benchmarks, namely CIFAR100, miniImageNet, and CUB200, demonstrate that incorporating our ADBS method with existing FSCIL techniques significantly improves performance, achieving overall state-of-the-art results.
Few-Shot Class-Incremental Learning (FSCIL) 的目标是从有限数量的训练样本中持续学习新的类别,同时不忘记之前已学类别的知识。传统的 FSCIL 方法通常在基础训练阶段使用大量的训练样本构建出一个稳健的功能提取器,并随后冻结这个提取器,在后续增量阶段仅对分类器进行微调。然而,当前策略主要集中在防止灾难性遗忘上,只考虑新类别和基类之间的关系,而不关注每个类别的具体决策空间。 为了应对这一挑战,我们提出了一种即插即用的自适应决策边界策略(ADBS),该策略与大多数 FSCIL 方法兼容。具体而言,我们在每个类中分配一个特定的决策边界,并在训练过程中根据每次会话中的需要来调整这些边界,从而优化各个类别的决策空间。此外,为了增强类别间的区别性,我们引入了一种新颖的跨类约束损失函数,用于优化每个类别的决策边界和原型。 通过在 CIFAR100、miniImageNet 和 CUB200 三个基准数据集上进行广泛的实验,结果表明将我们的 ADBS 方法与现有的 FSCIL 技术相结合可以显著提升性能,并达到整体的最先进水平。
https://arxiv.org/abs/2504.10976
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
跨域少量样本目标检测(CD-FSOD)在跨不同领域应用时,对现有的物体检测和少量样本检测模型提出了重大挑战。与NTIRE 2025联合组织的首次CD-FSOD挑战赛旨在通过仅使用有限的标注数据来提升当前物体探测器在完全新颖的目标域上的性能表现。此次挑战吸引了152名注册参赛者,收到了来自42个团队的提交,并最终有13支队伍完成了有效终审提交。参赛者从多个角度入手,提出了新的模型,在开源和闭源设置下均实现了最新的最优(SOTA)成果。 在本报告中,我们将概述首次NTIRE 2025 CD-FSOD挑战赛的情况,并重点介绍提出的解决方案以及总结参与者提交的结果。
https://arxiv.org/abs/2504.10685
Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at this https URL.
语言模型依赖于语义先验来进行上下文学习,这导致它们在涉及归纳推理的任务上表现不佳。基于模仿学习的指令微调方法可以表面上提升语言模型的上下文学习性能,但通常无法改进模型对少量样本演示中输入和输出之间关联规则的理解。我们提出了一种称为ReDis(Reasoning Distillation)的技术,旨在增强语言模型的归纳推理能力。通过仔细结合数据增强、过滤、监督微调以及对齐技术,ReDis在包括1D-ARC、List Function、ACRE和MiniSCAN在内的各种任务上实现了显著的性能提升。 实验结果显示,在三种不同的语言模型骨干结构(backbones)上,ReDis在所有任务中均优于等效的少量样本提示基线,并且在某些情况下甚至超过了教师模型GPT-4o。基于LLaMA-3架构的ReDis分别在1D-ARC、ACRE和MiniSCAN任务上比GPT-4o获得了23.2%、2.8%和66.6%的相对改进,这些性能提升是在相似假设搜索空间内实现的。 代码、数据集以及模型检查点将在以下网址提供:[URL](请将方括号内的文字替换为实际链接地址)。
https://arxiv.org/abs/2504.10647
Robots with wheeled, quadrupedal, or humanoid forms are increasingly integrated into built environments. However, unlike human social learning, they lack a critical pathway for intrinsic cognitive development, namely, learning from human feedback during interaction. To understand human ubiquitous observation, supervision, and shared control in dynamic and uncertain environments, this study presents a brain-computer interface (BCI) framework that enables classification of Electroencephalogram (EEG) signals to detect cognitively demanding and safety-critical events. As a timely and motivating co-robotic engineering application, we simulate a human-in-the-loop scenario to flag risky events in semi-autonomous robotic driving-representative of long-tail cases that pose persistent bottlenecks to the safety performance of smart mobility systems and robotic vehicles. Drawing on recent advances in few-shot learning, we propose a dual-attention Siamese convolutional network paired with Dynamic Time Warping Barycenter Averaging approach to generate robust EEG-encoded signal representations. Inverse source localization reveals activation in Broadman areas 4 and 9, indicating perception-action coupling during task-relevant mental imagery. The model achieves 80% classification accuracy under data-scarce conditions and exhibits a nearly 100% increase in the utility of salient features compared to state-of-the-art methods, as measured through integrated gradient attribution. Beyond performance, this study contributes to our understanding of the cognitive architecture required for BCI agents-particularly the role of attention and memory mechanisms-in categorizing diverse mental states and supporting both inter- and intra-subject adaptation. Overall, this research advances the development of cognitive robotics and socially guided learning for service robots in complex built environments.
机器人,无论是轮式、四足还是人形形态,在建筑环境中越来越普遍。然而,它们缺乏一种对内在认知发展至关重要的学习途径:从人类反馈中学习互动过程中获得的知识。为了理解在动态和不确定的环境中的广泛观察、监督以及共控现象,本研究提出了一种脑机接口(BCI)框架,该框架能够通过分类脑电图(EEG)信号来检测认知需求强烈且安全关键性事件。作为及时而具有激励性的协作机器人工程应用实例,我们模拟了人在回路中的情景以标记半自主驾驶机器人的危险事件——这代表了长尾案例,在这些情况下智能移动系统和无人驾驶车辆的安全性能面临着持续的瓶颈问题。 利用近期在少样本学习上的进展,我们提出了一种双注意力Siamese卷积网络结合动态时间规整重心平均方法来生成稳健的EEG编码信号表示。逆向源定位揭示了Brodmann区域4和9的激活情况,在任务相关的心理意象期间显示出了感知-行动耦合。 该模型在数据稀缺条件下达到了80%的分类准确率,并且其显著特征的效用相比现有最佳方法几乎提高了100%,这一结果通过集成梯度归因得以衡量。除了性能,本研究还对BCI代理的认知架构提出了贡献——特别关注注意力和记忆机制的作用,在不同的心理状态分类和支持跨个体适应性方面。 总体而言,这项研究推动了认知机器人学和社会引导学习的发展,使得服务型机器人在复杂的建筑环境中能够更好地适应与互动。
https://arxiv.org/abs/2504.10296
Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.
多模态大型语言模型(MLLMs)通过整合文本、图像和代码等多种模式,正在改变机器处理并生成人类样响应的方式。然而,有效利用其能力的关键在于最优的提示工程设计。我们对七种提示工程技术在13个开源MLLM上进行了全面实验评估,涵盖了24项任务,这些任务分为推理与组合性、多模态理解和对齐、复杂代码生成和执行以及知识检索和整合四大类。我们的方法根据参数数量将模型划分为小型(<4B)、中型(4B-10B)和大型(>10B)三个类别,并对比了包括零样本、单样本、少量样本、思维链、类比、生成知识和思想树在内的提示技术。 虽然大型MLLM在结构化任务如代码生成方面表现出色,在少量样本提示下能达到高达96.88%的准确性,但所有模型在复杂推理和抽象理解上都面临挑战,通常准确率低于60%,且幻觉(即错误或不准确的信息)发生率较高。结构化的推理提示往往导致小型模型中的幻觉增加至75%,并使大型MLLM响应时间延长(超过20秒),而简单的提示方法则提供了更简洁和高效的输出。 没有单一的提示方法能够统一优化所有类型的任务。相反,结合基于示例指导与选择性结构化推理的方法对于提高模型鲁棒性、效率和事实准确性至关重要。我们的研究结果为提示工程实践提出了实用建议,并支持在包括AI辅助编码、知识检索及多模态内容理解等应用中的MLLM更可靠部署。
https://arxiv.org/abs/2504.10179
Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.
开放世界的三维语义占用预测旨在从传感器输入中生成体素化的3D表示,并识别已知和未知的物体。将视觉-语言模型(VLM)中的开放式词汇知识转移到该领域提供了一种有前景的方向,但仍然面临挑战。然而,基于VLM派生的2D伪标签与传统监督的方法受到预定义标签空间的限制,无法进行通用预测。另一方面,直接与预先训练的图像嵌入对齐由于视觉-语言模型中的图像和文本表示经常不一致而难以实现可靠性能。 为了应对这些挑战,我们提出了AGO(自适应接地3D占用预测框架),旨在处理各种开放世界的场景。AGO首先将周围的图像和类别提示分别编码为3D和文本嵌入,并利用基于相似性的对齐训练与3D伪标签相结合的方法。此外,一种模态适配器将3D嵌入映射到一个与VLM衍生的图像嵌入空间对齐的空间中,从而减少模态差距。 在Occ3D-nuScenes数据集上的实验表明,AGO在零样本和少量样本迁移时提高了未知物体预测的准确性,并且达到了最先进的封闭世界自监督性能,在mIoU指标上优于先前的方法4.09分。
https://arxiv.org/abs/2504.10117
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.
几帧动作识别(Few-shot Action Recognition,FSAR)的目标是使用少量示例来识别新的动作类别。现有方法通常通过设计各种跨帧时间建模策略为每个视频独立地学习帧级表示。然而,这些方法忽略了显式的视频与任务之间的关系建模,从而无法捕捉跨视频的时间模式并复用历史任务中的时间知识。为此,我们提出了HR2G-shot框架,这是一种层次化关系增强表征泛化的FSAR框架,它统一了三种类型的关系建模(帧间、视频间和任务间),从整体视角学习特定于任务的时间模式。 除了进行跨帧时间交互之外,我们还设计了两个组件来分别探索视频间与任务间的关系: 1. 视频间语义相关性(Inter-video Semantic Correlation, ISC):在细粒度级别上执行跨视频的帧级交互,从而捕捉特定于查询的动作特征,并学习支撑动作特征集内的和跨类别的时间相关性。 2. 任务间知识转移(Inter-task Knowledge Transfer, IKT):从存储了来自历史任务的各种时间模式的知识库中检索并聚合相关的知识。 在五个基准数据集上的广泛实验表明,HR2G-shot超越了当前最先进的FSAR方法。
https://arxiv.org/abs/2504.10079