This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model's explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
这项研究探讨了多模态大型语言模型(LLM),特别是ChatGPT-4o,在使用静态行车记录仪图像进行类似人类的交通场景解读方面的潜力。在此,我们重点关注三种与老年人驾驶员评估相关的判断任务:评价交通密度、评估交叉路口视野和识别停止标志。这些任务需要情境推理而不是简单的物体检测。通过零样本(zero-shot)、少量样本(few-shot)和多样本(multi-shot)提示策略,我们在人类注释作为参考标准的情况下评估了模型的性能。评估指标包括精确度、召回率和F1分数。 结果显示,提示设计显著影响了模型的表现,在交叉路口视野方面,召回率从零样本条件下的21.7%提高到了多样本条件下的57.0%;在交通密度方面,一致性从53.5%增加到67.6%。对于停止标志检测,该模型展现了高精确度(高达86.3%),但较低的召回率(大约为76.7%),表明其倾向于保守反应。 输出稳定性分析显示,人类和模型在解读结构模糊场景时都面临困难,不过,模型生成的解释性文本与其预测相吻合,提高了可解释性。这些发现表明,在设计良好的提示条件下,LLM作为用于场景级驾驶风险评估的支持工具具有潜力。未来的研究应该探索使用更大规模的数据集、多样化的标注人员以及下一代模型架构来进行老年人驾驶员评估的可能性。
https://arxiv.org/abs/2507.08367
Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at this https URL.
真菌种类的准确识别在计算机视觉领域面临着独特的挑战,主要是由于细粒度的种间差异和高度的种内变异。本文介绍了我们为FungiCLEF 2025竞赛设计的方法,该方法聚焦于使用FungiTastic Few-Shot数据集进行少量样本下的细粒度视觉分类(FGVC)。我们的团队(DS@GT)尝试了多种视觉变换模型、数据增强技术、加权采样以及文本信息的融合。我们还探索了生成式AI模型在零样本分类中的应用,通过结构化提示实现,但发现这些方法相对于基于视觉的方法来说表现不佳。 最终,我们的模型超越了竞赛中的基准,并强调了领域特定预训练和平衡采样策略的有效性。在比赛结束后对私人测试集的评估中,我们的方法排名为74个参赛团队中的第35位,这表明在元数据选择和跨域多模态学习方面还有进一步研究的空间。 我们的代码可在[此处](此链接应该指向一个公开可用的位置,例如GitHub)获取。
https://arxiv.org/abs/2507.08248
We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.
我们提出了一种新的方法,用于通过大型语言模型从自然语言问题或关键词查询生成RDF知识图谱上的SPARQL查询。我们的方法不需要微调,而是利用语言模型通过战略性地执行SPARQL查询并在知识图中搜索相关IRIs和文字来探索知识图。我们在各种基准测试(涵盖不同类型和大小的知识图)和语言模型(包括不同规模、类型以及商业和开源模型)上评估了该方法,并与现有方法进行了比较。在Wikidata上,尽管是在零样本设置下,我们依然在多个基准测试中达到了最先进的结果;在Freebase上,我们的方法接近于最优的少样本方法;在其他较少被评价的知识图谱和基准测试中,我们的方法整体表现良好。此外,我们还进行了一些额外的研究,比如比较不同的图形搜索方式、引入反馈机制或利用少量示例等。
https://arxiv.org/abs/2507.08107
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
关键点检测是现代机器感知的核心组成部分,但在少量样本学习中面临挑战,特别是在缺乏与查询数据同分布的源数据时。为解决这一问题,我们利用了素描这种流行的人类表达形式,提供了一种无需来源数据的替代方案。然而,在掌握跨模态嵌入和处理用户特定的素描风格方面仍然存在挑战。我们的框架通过结合原型设置、基于网格的位置检测器以及原型领域自适应技术来克服这些障碍。我们还通过广泛的实验展示了在新颖关键点和类别上实现少量样本收敛的成功。
https://arxiv.org/abs/2507.07994
This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.
本文提出了一种基于视觉基础模型的新型少量样本跨域异常检测框架Nexus Vision Transformer for Anomaly Detection (NexViTAD),通过创新性的共享子空间投影机制和多任务学习(MTL)模块,有效地解决了工业异常检测中的领域偏移(domain-shift)挑战。主要创新包括: 1. 一个层次适配器模块,该模块能够自适应地融合来自Hiera和DINO-v2预训练模型的互补特征,构建更稳健的特征表示。 2. 共享子空间投影策略,通过瓶颈维度约束和跳跃连接机制使跨域知识转移有效进行。 3. 多任务学习解码器架构支持同时处理多个源领域,显著增强模型泛化能力。 4. 基于Sinkhorn-K-means聚类的异常评分推断方法,并结合高斯滤波和自适应阈值处理实现精确像素级别的检测。 在MVTec AD数据集上进行评估时,NexViTAD在目标域中分别达到了AUC 97.5%、AP 70.4% 和PRO 95.2% 的顶尖性能,超越了其他最近的模型,在跨领域缺陷检测方面实现了变革性的进展。
https://arxiv.org/abs/2507.07579
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
本文提出了一种使用大型语言模型(LLM)和少量提示进行多语种主观性检测的竞争方法。我们在CheckThat! 2025评估活动中参加了任务1:主观性检测。研究表明,当与精心设计的提示相结合时,LLM可以匹配或超越经过微调的小型语言模型(SLM),尤其是在嘈杂或低质量数据环境中。 尽管我们尝试了先进的提示工程技术,如让LLM进行辩论和多种示例选择策略,但我们发现除了精心制作的标准少量提示外,并没有获得显著的额外好处。我们的系统在CheckThat! 2025主观性检测任务中多个语种的比赛中取得了顶尖排名,包括阿拉伯语和波兰语的第一名以及意大利语、英语、德语和多语言赛道中的前四名。 值得注意的是,我们方法在阿拉伯语文本数据集上表现尤为稳健,这可能是因为其对标注不一致具有较强的鲁棒性。这些发现强调了基于LLM的少量学习对于跨语言情感任务的有效性和适应性,并为传统微调提供了强大的替代方案,特别是在标签数据稀缺或不一致的情况下。
https://arxiv.org/abs/2507.07539
The labor-intensive nature of medical data annotation presents a significant challenge for respiratory disease diagnosis, resulting in a scarcity of high-quality labeled datasets in resource-constrained settings. Moreover, patient privacy concerns complicate the direct sharing of local medical data across institutions, and existing centralized data-driven approaches, which rely on amounts of available data, often compromise data privacy. This study proposes a federated few-shot learning framework with privacy-preserving mechanisms to address the issues of limited labeled data and privacy protection in diagnosing respiratory diseases. In particular, a meta-stochastic gradient descent algorithm is proposed to mitigate the overfitting problem that arises from insufficient data when employing traditional gradient descent methods for neural network training. Furthermore, to ensure data privacy against gradient leakage, differential privacy noise from a standard Gaussian distribution is integrated into the gradients during the training of private models with local data, thereby preventing the reconstruction of medical images. Given the impracticality of centralizing respiratory disease data dispersed across various medical institutions, a weighted average algorithm is employed to aggregate local diagnostic models from different clients, enhancing the adaptability of a model across diverse scenarios. Experimental results show that the proposed method yields compelling results with the implementation of differential privacy, while effectively diagnosing respiratory diseases using data from different structures, categories, and distributions.
医疗数据标注的劳动密集型特性对呼吸道疾病诊断构成了重大挑战,导致资源受限环境中高质量标记数据集的稀缺。此外,患者隐私问题使得难以直接跨机构共享本地医学数据,并且现有的集中式数据驱动方法依赖于大量可用的数据,在保护数据隐私方面往往存在妥协。本研究提出了一种带有隐私保护机制的联邦小样本学习框架,以解决在诊断呼吸道疾病时面临的有限标注数据和隐私保护的问题。具体来说,提出了一个元随机梯度下降算法来减轻由于缺乏数据而在使用传统梯度下降方法训练神经网络时出现的过拟合问题。此外,为了确保在本地数据上训练私有模型时对抗梯度泄露的数据隐私,在梯度中加入了来自标准高斯分布的差分隐私噪声,从而防止医学图像的重建。鉴于集中化分散于不同医疗机构中的呼吸道疾病数据不切实际,采用了加权平均算法来聚合来自不同客户端的局部诊断模型,以增强模型在各种场景下的适应性。实验结果表明,所提出的方法在实现差分隐私的情况下取得了令人信服的结果,并且能够有效地使用不同结构、类别和分布的数据进行呼吸道疾病的诊断。
https://arxiv.org/abs/2507.08050
Phishing attacks are becoming increasingly sophisticated, underscoring the need for detection systems that strike a balance between high accuracy and computational efficiency. This paper presents a comparative evaluation of traditional Machine Learning (ML), Deep Learning (DL), and quantized small-parameter Large Language Models (LLMs) for phishing detection. Through experiments on a curated dataset, we show that while LLMs currently underperform compared to ML and DL methods in terms of raw accuracy, they exhibit strong potential for identifying subtle, context-based phishing cues. We also investigate the impact of zero-shot and few-shot prompting strategies, revealing that LLM-rephrased emails can significantly degrade the performance of both ML and LLM-based detectors. Our benchmarking highlights that models like DeepSeek R1 Distill Qwen 14B (Q8_0) achieve competitive accuracy, above 80%, using only 17GB of VRAM, supporting their viability for cost-efficient deployment. We further assess the models' adversarial robustness and cost-performance tradeoffs, and demonstrate how lightweight LLMs can provide concise, interpretable explanations to support real-time decision-making. These findings position optimized LLMs as promising components in phishing defence systems and offer a path forward for integrating explainable, efficient AI into modern cybersecurity frameworks.
网络钓鱼攻击变得越来越复杂,这强调了需要开发一种既能保证高精度又具有计算效率的检测系统。本文通过在精心策划的数据集上进行实验,对传统机器学习(ML)、深度学习(DL)和量化小型参数大型语言模型(LLM)在网络钓鱼检测方面的性能进行了比较评估。 研究发现,尽管目前大型语言模型在原始准确性方面仍然不及机器学习和深度学习方法,但它们在识别细微、基于上下文的网络钓鱼线索方面表现出巨大潜力。我们还调查了零样本和少量样本提示策略对结果的影响,并揭示出由LLM重写的电子邮件会显著降低ML和基于LLM检测器的性能。 我们的基准测试显示,诸如DeepSeek R1 Distill Qwen 14B(Q8_0)这样的模型,在仅使用17GB显存的情况下即可达到超过80%的准确性,这表明它们在成本效益部署方面是可行的选择。此外,我们还评估了这些模型对对抗性攻击的鲁棒性和成本性能权衡,并展示了轻量级LLM如何提供简洁且可解释的说明来支持实时决策。 这些发现将优化后的大型语言模型定位为网络钓鱼防御系统中的有前景组件,并为进一步在现代网络安全框架中集成可解释、高效的AI技术铺平了道路。
https://arxiv.org/abs/2507.07406
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: this https URL.
通用深度完成技术能够获取未见过环境中密集的度量深度图,为各种下游任务提供强大的感知能力。然而,训练这类模型通常需要大规模带有度量深度标签的数据集,而这些数据集往往耗时且劳动强度大。本文介绍了一种名为PacGDC的技术,这是一种标注效率高的方法,能够在几乎不需要额外注释工作的情况下增强数据多样性,从而提升通用深度完成的能力。 PacGDC基于对从2D到3D投影过程中物体形状和位置内在模糊性和一致性的新颖见解,能够为相同的视觉场景合成大量伪几何体。通过操纵相应深度图的场景尺度,这一过程极大地拓宽了可用的几何结构范围。为了利用这一特性,我们提出了一种新的数据合成流水线,该管道使用多个深度基础模型作为尺度调整工具。 这些模型能够提供具有不同场景规模变化的伪深度标签,并影响局部物体和全局布局的同时保证投影的一致性以支持泛化能力。为进一步丰富几何结构多样性,我们将插值策略、重定位策略以及无标注图像纳入其中,从而将数据覆盖面扩展至单一基础模型使用之外的情况。 通过广泛的实验测试表明,在多种基准测试中PacGDC表现出卓越的跨场景语义/规模和深度稀疏度/模式下的零样本(few-shot)及少量样本设置(zero-shot)中的泛化能力。代码链接: [提供具体的URL]
https://arxiv.org/abs/2507.07374
Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP's pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20\% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at this https URL.
现代深度学习在医学影像领域的实现通常依赖于大规模的标注数据集。由于隐私问题、高昂的成本以及病例稀缺等原因,这些数据集往往难以获取。本文提出了一种针对胸部X光诊断的标签高效策略,旨在反映真实世界的医院场景。实验中使用了NIH Chest X-ray14数据集和一个预训练的CLIP ViT-B/32模型。通过部分微调该模型的视觉编码器,并利用零样本学习和少量样本学习(每种疾病类别从1到16个标注示例)对模型进行评估。实验结果表明,CLIP预先训练好的视觉-语言特征可以有效地适应医疗影像中的少数样本任务,在平均AUC得分上相较于零样本基准方法提高了超过20%。 这项工作的关键在于尝试模拟内部医院的工作流程,在这种流程中存在图像档案但标注信息稀少。该研究评估了一种针对常见病和罕见病诊断的实际且可扩展的解决方案。需要注意的是,本研究仅用于学术和实验目的,并未经过同行评审。所有代码可在以下链接找到:[这里插入URL]。
https://arxiv.org/abs/2507.07254
Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in old domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, where each expert is guided by specialized loss functions to learn features for specific frequency groups, effectively addressing intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism navigates the trade-off between preserving many-shot knowledge of previous domains and leveraging new data to improve few-shot class performance in earlier tasks. Extensive experimental results on four benchmark datasets demonstrate DCE's state-of-the-art performance.
领域递增学习(DIL)专注于在非平稳环境中进行连续学习,要求模型能够适应不断变化的领域环境,并同时保持历史知识。在处理不平衡数据的情况下,DIL面临两个关键挑战:域内类别不平衡和跨域类别分布偏移。这些问题严重制约了模型的表现能力,因为域内的类别不平衡会导致少数样本类别的欠拟合问题;而跨域迁移则需要维持过去多示例类别的良好学习成果,并利用新信息来改进早期任务中少数示例类别的表现。 为了克服这些挑战,我们引入了双重平衡协作专家(DCE)框架。DCE采用了一种频率感知的专家小组,在这个小组中,每个专家通过专门设计的损失函数被引导去为特定频率组学习特征,从而有效解决了域内的类别不平衡问题。随后,一个动态专家选择器通过从历史类别的统计信息中平衡高斯采样合成伪特征来实现学习。这种机制能够在保留先前领域中的多示例知识和利用新数据提高早期任务少数样本类别性能之间找到平衡点。 在四个基准数据集上的广泛实验结果表明,DCE框架展示了最先进的性能。
https://arxiv.org/abs/2507.07100
Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.
大型多模态模型(LMM)在餐食图像营养分析中的应用越来越广泛。然而,现有研究主要集中在评估专有模型,如GPT-4,这导致了对大量其他LLM的探索不足。此外,整合上下文元数据及其与各种推理修饰符交互的影响仍是一个未被充分探索的领域。这项工作探讨了如何通过解析从GPS坐标(转换为位置/场所类型)、时间戳(转化为餐食/日期类型)以及餐食中食物项目获取的上下文元数据来增强LMM在估算关键营养值方面的性能,包括卡路里、宏量营养素(蛋白质、碳水化合物、脂肪)和份量大小。我们还引入了一个新的食品图像数据集——ACETADA,并计划公开发布该数据集。这个开放的数据集提供了经过注册营养师验证的营养信息,并作为我们分析的基础。 我们在八种LMM(四种开源模型和四种闭源模型)上的评估首先证明了整合上下文元数据比单纯使用图像进行直接提示所带来的益处。接着,我们展示了这种上下文信息的融入如何增强推理修饰符的效果,如思维链、多模态思维链、尺度提示、少量示例学习及专家角色等。 实证结果表明,智能地整合元数据,并通过简单的提示策略应用时,可以显著降低预测营养值的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。这项工作强调了上下文感知LMM在改善营养分析方面的潜力。
https://arxiv.org/abs/2507.07048
Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.
近期在大型语言模型(LLM)方面的进展强调了增强推理能力以有效解决复杂问题的必要性。尽管“链式思维”(Chain-of-Thought,CoT) 推理是一个进步,但它对于许多领域仍然不足。一种有前途的替代方案是显式的高层次计划生成,但现有的方法大多假设LLM可以通过少量样本提示单独产生有效的计划,而无需额外训练。在本研究中,我们挑战这一假设,并引入了CRISP(基于可解释步骤的复杂推理),这是一个针对数学推理和代码生成的多领域高层次计划数据集。CRISP中的计划是自动产生的并经过严格的验证——既内在地通过LLM作为评判进行验证,也外在地通过对下游任务性能的影响来评估。我们证明,在CRISP上对一个小模型进行微调使其能够产生比使用少量样本提示的大得多的模型更高的质量计划,并且显著优于链式思维推理。此外,我们的跨领域评估显示,一个领域的微调改善了另一个领域的计划生成能力,突显出学习到的规划能力的通用性。
https://arxiv.org/abs/2507.08037
Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP's adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at this https URL.
医学异常检测(AD)由于成像模态多样、解剖结构变异以及标注数据有限而颇具挑战。我们提出了一种结合视觉适配器和提示学习的方法,并引入了部分最优传输(POT)及对比学习(CL),以增强CLIP在医疗图像上的适应性,尤其是在异常检测方面。与标准的提示学习不同,后者通常只产生单一表示,我们的方法使用多个通过POT对齐局部特征的提示来捕捉细微的异常情况。此外,对比学习进一步增强了类内一致性及类间区分度。本研究在少量样本、零样本以及跨数据集场景下均达到了最先进的性能,并且没有使用合成数据或内存银行。代码可在[此链接](https://this_https_URL.com)获取。
https://arxiv.org/abs/2507.06733
Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.
大型语言模型(LLMs),包括零样本和少样本范式,在临床文本生成方面展现出了令人鼓舞的能力。然而,实际应用面临两个关键挑战:(1) 患者数据高度不规范、异质化,并分散在多种记录类型中;(2) 临床记录通常很长且语义密集,这使得由于上下文长度限制和遗漏重要临床信息的风险,简单的提示方法变得不可行。我们引入了CLI-RAG(基于临床信息的检索增强生成),这是一种使用LLMs进行结构化和临床基础文本生成的专业领域框架。它结合了一种新颖的分层切块策略,该策略尊重临床文档结构,并引入了一个特定任务的双阶段检索机制。全局阶段通过基于证据的查询识别相关记录类型,而局部阶段则在这些笔记中提取高价值内容,在文档和部分级别上创建相关性。我们将系统应用于使用MIMIC-III数据集中的15种临床记录类型生成结构化的进展记录,以反映个人医院访问的情况。实验表明,该系统能够在不同就诊之间保持时间和语义的一致性,达到了87.7%的平均一致性得分,超过了真实临床医生撰写的笔记中80.7%的基础线水平。生成的结果还展示了在LLMs之间的高度一致性,这加强了确定性的行为,这对于可重复性、可靠性和临床信任至关重要。
https://arxiv.org/abs/2507.06715
The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2\% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30\%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6\%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
小型语言模型(SLMs)作为敏感应用中隐私保护的替代方案出现,引发了关于它们与大型语言模型(LLMs)相比内在理解能力的根本问题。本文通过在多种分类任务上的系统评估,研究了当前SLMs在心理健康理解方面的能力。我们采用零样本和少量样本学习范式,在基准LLM上进行性能测试,以阐明其在这一关键领域的相对优势和局限性。我们评估了五种最先进的SLMs(Phi-3、Phi-3.5、Qwen2.5、Llama-3.2、Gemma2)与三种LLMs(GPT-4、FLAN-T5-XXL、Alpaca-7B)在六个心理健康理解任务上的表现。我们的研究发现,SLMs在二元分类任务中的平均性能仅比LLMs低2%(零样本设置下的F1分数分别为0.64和0.66),尽管它们的参数数量少得多,仍表现出显著的能力。两类模型在多类严重程度任务上都经历了类似的性能下降(超过30%的降幅),这表明复杂的临床理解挑战超出了模型规模的影响范围。对于SLMs而言,少量样本提示提供了显著的改进(高达14.6%),而LLM的改进则更加不稳定。我们的研究强调了SLMs在心理健康理解领域的潜力,证明它们可以成为分析敏感在线文本数据的有效隐私保护工具。特别是,它们通过少量样本学习快速适应和专业化的独特能力使其有希望成为可扩展的心理健康筛查工具。
https://arxiv.org/abs/2507.08031
In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at this https URL.
在这篇论文中,作为CLEF 2025 CheckThat! Task 4a 科学网络话语检测任务的DS@GT团队,我们介绍了我们在该任务上探索的方法。对于这个多分类任务,我们确定了一条推文是否包含科学声明、对科学研究或出版物的引用以及/或者提及了如大学或科学家等科学实体。我们为这项任务提出了三种建模方法:transformer微调、LLM的few-shot提示法和一个结合模型,该模型的设计受到了早期实验结果的影响。我们的团队在比赛中排名第七,取得了0.8611的宏平均F1分数,相较于DeBERTaV3基准线(0.8375)有了提升。我们代码托管于Github,链接为[此处](https://github.com/your-team-repo/checkthat2025)。
https://arxiv.org/abs/2507.06205
Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
在文档级别的实体和关系抽取中,特别是在零样本或小样本设置下,大型高质量的标注语料库仍然非常稀缺。本文提出了一种基于大语言模型(LLM)的全自动数据生成及上下文学习管道,用于合成数据生成以及文档级实体与关系抽取任务中的零样本推理。 相较于现有方法依赖于手动注释示例或直接进行零样本推理,我们的方法结合了合成数据生成和检索增强型上下文学习技术,并使用经过优化以提高推理能力的语言模型。这样,我们能够在不需人工标注的情况下构建高质量的示范数据库,并在推断时动态地检索相关实例。 基于这一方法,我们生成了一个包含超过5000篇维基百科摘要的合成数据集,其中约有59,000个实体和30,000条关系三元组。最后,我们在DocIE共享任务上评估了上下文学习在零样本设置下从长文档中抽取实体和关系的表现。 我们发现,即使对于最先进的大型语言模型而言,在文档级别进行零样本联合实体与关系抽取仍然是一个具有挑战性的任务。
https://arxiv.org/abs/2507.05997
This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.
本文描述了Unibuc-NLP团队在应对SemEval 2025研讨会任务11:基于文本的情感检测缺口中的方法。我们主要集中在使用大型语言模型(Gemini、Qwen、DeepSeek)进行实验,这些实验要么采用少量样本提示法,要么进行微调。最终系统中,在多标签情感检测赛道(赛道A),我们在英语子集上取得了0.7546的F1-macro评分(在26支队伍中排名第96),葡萄牙语(莫桑比克)子集上的得分是0.1727(在36支队伍中排名35),而在Emakhuwa语言子集中则获得了0.325的分数(在31支队伍中排名第1)。
https://arxiv.org/abs/2507.05918
The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called \textbf{I$^2$R}: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9\% and 2.1\% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks, respectively.
语义分割中的标注瓶颈已经激发了对少样本分割(Few-Shot Segmentation,FSS)的极大兴趣。少样本分割的目标是开发能够使用少量示例快速推广到新类别的分割模型。传统的训练范式通常通过从支持图像中提取掩码区域特征来生成查询先验图,然后根据这些先验图进行预测。然而,当前的方法仍然受到两个关键限制的影响:一是来自支持图像和查询图像之间的语义差距导致的不匹配特征及错误先验图;二是支持图像或查询图像内视觉相似但语义不同的区域会导致假阴性或假阳性预测。这些问题都严重降低了分割性能。 为了解决上述问题,我们提出了一种名为\textbf{I$^2$R}的新少样本分割方法: 1. 使用特定类别的高级表示形式来聚合来自支持图像和查询图像的全局语义线索,这有助于更精确地定位跨图像区域,并解决第一个限制。 2. 采用定向掩码策略抑制具有高特征相似性但标签冲突的支持-查询像素对,从而缓解第二个问题。 实验表明,我们的方法在PASCAL-5$^i$和COCO-20$^i$基准测试的1-shot设置下分别实现了mIoU指标上的改进:提高了1.9\% 和 2.1\%,超过了现有的最先进方法。
https://arxiv.org/abs/2507.05838