Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at this https URL.
出界(Out-of-distribution,OOD)检测和分割对于在自动驾驶和机器人辅助手术等安全关键应用中部署机器学习模型至关重要。尽管之前的大多数研究主要集中在单模态图像数据上,但现实世界的应用本质上是多模态的,需要整合多种模态以提高OOD检测的效果。一个关键挑战是没有来自未知数据的监督信号,导致模型在处理OOD样本时过于自信。为解决这一挑战,我们提出了特征混合(Feature Mixing)方法,这是一种极其简单且快速的方法,用于生成具有理论支持的多模态异常值,可以通过进一步优化帮助模型更好地区分已知分布(in-distribution,ID)和OOD数据。特征混合与模式无关,并适用于各种模态组合。 此外,我们还介绍了CARLA-OOD,这是一个新颖的多模态数据集,用于OOD分割任务,其中包含在不同场景和天气条件下合成的OOD物体。在SemanticKITTI、nuScenes、CARLA-OOD以及MultiOOD基准测试上进行的大量实验表明,特征混合方法能够实现最先进的性能,并且速度提高了10倍到370倍。我们的源代码和数据集将在[此处](https://this https URL)提供。 该段落翻译为中文后清晰地介绍了研究背景、提出的方法及其优势,以及用于验证新方法的数据集和实验结果。
https://arxiv.org/abs/2505.16985
The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.
大型语言模型(LLMs)在敏感应用中的广泛应用,突显了确保AI生成文本来源和责任的有效水印技术的需求。然而,大多数现有的水印方法需要访问解码过程,这限制了它们在实际场景中的适用性。一个典型的例子是在学术同行评审的背景下,不诚实的评论者使用LLMs进行评论,而会议组织者没有模型使用权但仍然需要检测出AI生成的评论。为了填补这一空白,我们引入了上下文水印(ICW)技术,该技术仅通过提示工程将水印嵌入到生成的文本中,利用了LLMs的上下文学习和指令跟随能力。我们研究了四种不同粒度级别的ICW策略,并为每种策略配对了定制化的检测方法。此外,我们还考察了间接提示注入(IPI)设置作为一项具体案例研究,在这种情况下,通过修改如学术手稿之类的输入文档来隐秘触发水印。我们的实验验证了ICW作为一种模型无关且实用的水印技术的可行性。此外,我们的发现表明,随着LLMs的能力越来越强,ICW为可扩展和便捷的内容归属提供了一个有前景的方向。
https://arxiv.org/abs/2505.16934
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
个人可识别信息(PII)的匿名化是一项高风险任务,为许多开放科学数据共享倡议设置了障碍。虽然近年来在PII识别方面已经取得了显著进展,但在实践中,错误阈值和召回率/精确度权衡仍然限制了这些匿名化管道的应用推广。我们提出了一个名为PIIvot的轻量级框架,该框架利用对数据上下文的理解来简化PII检测问题。为了证明其有效性,我们也贡献了一个名为QATD-2k的数据集,这是同类中最大的开源现实世界辅导数据集,以支持高质量教育对话数据的需求。
https://arxiv.org/abs/2505.16931
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
社交媒体上多模态错误信息的快速传播引发了越来越多的关注,但由于缺乏大规模、多样化的数据集,有关视频错误信息检测的研究仍然有限。现有方法往往过度拟合于僵化模板,并且在处理欺骗性内容时缺乏深度推理。为解决这些挑战,我们引入了FakeVV,这是一个包含超过10万对视频-文本的数据基准集合,带有细致可解释的标注。此外,我们还提出了Fact-R1,这是一种将深层推理与基于规则的协作强化学习相结合的新框架。Fact-R1通过三个阶段进行训练:(1)错误信息长思维链(CoT)指令微调;(2)通过直接偏好优化(DPO)实现偏好转向;以及(3)使用新型可验证奖励函数进行群体相对策略优化(GRPO)。这使得Fact-R1能够在复杂的多模态错误信息环境中展现出与高级文本强化学习系统相媲美的新兴推理行为。我们的工作确立了错误信息检测的新范式,连接大规模视频理解、引导推理对齐以及可解释性验证。
https://arxiv.org/abs/2505.16836
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
在过去十年中,视频在雨天下的修复技术取得了显著进展,这主要得益于深度学习的进步。然而,依赖配对数据的现有方法难以有效地泛化到真实世界场景中,主要原因在于合成与实际雨效果之间的差异。为了克服这些限制,我们提出了一种双分支时空状态空间模型,旨在提高视频序列中雨迹去除的效果。具体来说,我们设计了用于提取空间特征的空间状态空间模型层和利用帧间时间依赖性的时态状态空间模型层。 为了改进多帧特征融合,我们推导出一种动态堆叠滤波器,该滤波器能够自适应地逼近统计滤波器,并实现更优的逐像素特征细化。此外,我们开发了一种中值堆叠损失函数,利用雨稀疏先验生成伪干净补丁,以支持半监督学习。 为了进一步探索去雨模型在其他基于视觉的任务中的应用能力(特别是在雨天环境下),我们引入了一个新的真实世界基准测试平台,专注于雨天下的物体检测和跟踪任务。我们的方法经过了多个包含大量合成与实际雨视频的基准数据集的全面评估,并在定量指标、视觉质量、效率及下游任务实用性方面均表现出优越性。
https://arxiv.org/abs/2505.16811
Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at this https URL
监狱突破攻击(jailbreak attacks)对大型语言模型(LLMs)构成严重威胁,因为它们能够绕过内置的安全机制,并导致有害输出。研究这些攻击对于识别漏洞和提高模型安全性至关重要。本文从隐秘性这一新颖视角系统地调查了监狱突破方法。我们发现,现有攻击在同时实现有毒内容隐蔽性和语言自然性的隐形性方面存在困难。为此,我们提出了StegoAttack,这是一种完全隐形的监狱突破攻击方法,它使用信息隐藏技术(即,数字水印和信息隐写术)将有害查询隐藏在良性、语义连贯的文本中。然后通过提示LLM提取被隐藏的问题并以加密的方式进行响应来执行该攻击。这种方法有效地隐藏了恶意意图,同时保持自然性,使其能够避开内置的安全机制以及外部检测系统。 我们在四大主要提供商提供的四种安全对齐的大型语言模型上评估了StegoAttack,并与八种最先进的方法进行了基准测试。StegoAttack达到了92.00%的平均攻击成功率(ASR),比最强基线高出11.0%。即使在外部检测条件下,其攻击成功率下降幅度也不超过1%(例如,Llama Guard)。此外,在隐蔽性检测度量指标上,它获得了最佳的整体分数,展示了高度的有效性和出色的隐秘能力。 代码可在以下链接获取:[提供具体网址]
https://arxiv.org/abs/2505.16765
We explore the use of conformal prediction to provide statistical uncertainty guarantees for runway detection in vision-based landing systems (VLS). Using fine-tuned YOLOv5 and YOLOv6 models on aerial imagery, we apply conformal prediction to quantify localization reliability under user-defined risk levels. We also introduce Conformal mean Average Precision (C-mAP), a novel metric aligning object detection performance with conformal guarantees. Our results show that conformal prediction can improve the reliability of runway detection by quantifying uncertainty in a statistically sound way, increasing safety on-board and paving the way for certification of ML system in the aerospace domain.
我们探讨了使用符合预测(conformal prediction)为基于视觉着陆系统的跑道检测提供统计不确定性保证的方法。通过在航拍图像上对YOLOv5和YOLOv6模型进行微调,我们将符合预测应用于量化用户定义的风险水平下的定位可靠性。此外,我们还引入了一种新的度量标准——符合平均精度(Conformal mean Average Precision, C-mAP),该指标将对象检测性能与符合保证相结合。我们的研究结果表明,通过使用统计方法量化不确定性,符合预测可以提高跑道检测的可靠性,从而增强机上安全性,并为航空航天领域中机器学习系统的认证铺平道路。
https://arxiv.org/abs/2505.16740
The Earth's surface is subject to complex and dynamic processes, ranging from large-scale phenomena such as tectonic plate movements to localized changes associated with ecosystems, agriculture, or human activity. Satellite images enable global monitoring of these processes with extensive spatial and temporal coverage, offering advantages over in-situ methods. In particular, resulting satellite image time series (SITS) datasets contain valuable information. To handle their large volume and complexity, some recent works focus on the use of graph-based techniques that abandon the regular Euclidean structure of satellite data to work at an object level. Besides, graphs enable modelling spatial and temporal interactions between identified objects, which are crucial for pattern detection, classification and regression tasks. This paper is an effort to examine the integration of graph-based methods in spatio-temporal remote-sensing analysis. In particular, it aims to present a versatile graph-based pipeline to tackle SITS analysis. It focuses on the construction of spatio-temporal graphs from SITS and their application to downstream tasks. The paper includes a comprehensive review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water resource forecasting. It also discusses numerous perspectives to resolve current limitations and encourage future developments.
地球表面受到复杂且动态的过程影响,这些过程从板块构造运动等大规模现象到生态系统、农业或人类活动相关的局部变化不等。卫星图像能够提供广泛的时空覆盖范围,用于全球监测这些过程,并且在这一点上相比现场方法具有优势。特别是,生成的卫星影像时间序列(SITS)数据集包含有价值的信息。为了处理其庞大的体积和复杂性,最近的一些研究侧重于采用基于图的技术,这种方法放弃了卫星数据的常规欧几里得结构,转而在对象层面进行工作。此外,图还能够建模已识别对象之间的空间和时间交互,这对于模式检测、分类和回归任务至关重要。 本文旨在探讨将基于图的方法整合到时空遥感分析中的努力,并特别致力于提出一种灵活的基于图的工作流程以处理SITS分析。该研究聚焦于从SITS构建时空图以及这些图表在下游任务中的应用。文章包括了全面回顾及两个案例研究,突显了基于图方法在土地覆盖制图和水资源预测方面的潜力。同时,它还讨论了许多解决当前限制并鼓励未来发展的方法视角。 具体来说,本文概述如下: 1. **文献综述**:详尽地回顾了遥感图像的时空特性、SITS分析中的数据处理技术以及现有基于图方法在遥感领域的应用。 2. **案例研究**: - 土地覆盖制图:展示如何使用时空图来改进土地利用分类和变化检测任务,特别是对于快速城市化地区或生态系统转化区域。 - 水资源预测:展示了通过构建包含水体、土地表面特征及气象信息的时空网络模型来进行水资源量预测的方法。 3. **未来工作**:强调了该领域存在的挑战与机会,包括数据质量和计算效率改进的需求,并提出了利用图神经网络进行更复杂模式识别的可能性。
https://arxiv.org/abs/2505.16685
Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
电池对于电动汽车和可再生能源存储等各类应用至关重要,因此安全性和效率成为了关键问题。在电池热图像中进行异常检测有助于提前发现故障,但传统的深度学习方法需要大量标注数据,这些数据由于安全性风险及高昂的数据采集成本而难以获得。为解决这一难题,我们探索了利用视觉问答(VQA)模型进行零样本异常检测的方法,这种方法通过使用预训练的知识和基于文本的提示来在不同的视觉任务中实现泛化。结合正常的电池热行为先验知识,我们设计出可以不依赖于特定电池数据训练的提示以识别异常。 我们在三个VQA模型(ChatGPT-4o、LLaVa-13b 和 BLIP-2)上进行了评估,分析了它们对不同提示变化的鲁棒性以及重复实验的结果和定性输出。尽管这些模型没有针对电池数据进行微调,但我们的方法展示了与最先进的已训练电池数据的模型相比具有竞争力的表现。本研究结果突显了基于VQA的零样本学习在电池异常检测中的潜力,并提出了未来改进其有效性的方向。
https://arxiv.org/abs/2505.16674
Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.
医学异常检测(AD)对于早期临床干预至关重要,但由于隐私问题和数据孤岛导致的高质量医疗影像数据访问受限,这一领域面临着挑战。少样本学习作为一种有前景的方法已经出现,通过利用嵌入在视觉-语言模型(VLMs)中的大规模先验知识来缓解这些限制。近期关于少样本医学异常检测的研究通常将正常情况与异常情况视为一类分类问题处理,并且往往忽略了多个异常类别之间的差异。因此,在本论文中我们提出了一种专为需要识别多种异常类别的少样本医学异常检测场景设计的框架。 为了捕捉医疗异常类别的详细放射学标志,我们的框架引入了由大型语言模型生成的各种文本描述来表示每个类别。该假设认为,不同类型的医学图像中的异常可能在每个类别内共享一些共同的放射学标志。具体而言,我们提出了SD-MAD,这是一种两阶段驱动标识符(Sign-Driven)的少样本多异常检测框架:(i) 通过放大各异常类之间的差异来对齐放射学标志与异常类别;(ii) 在推理时采用自动标识符选择策略进一步选取对齐后的标识符以缓解由有限医疗数据导致的欠拟合和不确定样本问题。 此外,我们还提出了三种协议以全面量化多异常检测性能。大量的实验展示了我们方法的有效性。
https://arxiv.org/abs/2505.16659
Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at this https URL.
由于最近连接设备数量的增加,及时检测安全问题的需求日益迫切。此外,大量的通信流产生了处理海量数据的必要性。更重要的是,这些联网设备在计算能力上各不相同。鉴于此,在这项工作中我们提出了一种基于图像表示网络流量的方法,该方法能够利用1秒的时间窗口生成当前网络状况的紧凑摘要。这种表示法突出了异常的存在,从而减少了对复杂处理架构的需求。最后,我们介绍了一种有效的无监督学习方法,用于检测异常情况。代码和数据集可在[此处](https://example.com)获取(请将URL替换为实际提供的链接)。
https://arxiv.org/abs/2505.16650
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.
我们研究了针对多任务医学图像理解的视觉语言模型(VLMs)微调方法,重点关注在医学图像中检测、定位和计数病变的任务。我们的目标是评估通过指令调整后的VLM是否能够同时改善这些任务,并以此提高诊断准确性和效率。使用MedMultiPoints这一包含内窥镜(息肉和仪器)及显微镜检查(精子细胞)注释的多模态数据集,我们将每个任务重新表述为基于视觉语言推理的指令提示。我们采用低秩适应(LoRA)方法对Qwen2.5-VL-7B-Instruct模型进行多任务组合下的微调训练。实验结果显示,多任务训练提升了模型的鲁棒性和准确性,例如,在计数+定位任务中减少了计数平均绝对误差(MAE),提高了匹配准确度。然而,也存在一些权衡,比如更多的零点预测情况出现,这表明在边缘案例中的可靠性有所下降,尽管整体性能有所提升。 我们的研究强调了通过提示驱动的微调方法将通用视觉语言模型应用于专门医学任务的潜力。这种方法模仿了临床工作流程,其中放射科医生同时定位、计数并描述病变——展示了VLM如何学习复合诊断推理模式。该模型生成可解释和结构化的输出,为具有透明度与多样性的医疗人工智能提供了有前景的发展方向。代码、模型权重及脚本将在 [提供的URL] 上发布,以确保研究的可重复性。
https://arxiv.org/abs/2505.16647
Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors' opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs' limitations in handling nuanced frames and stance expressions on climate change internet memes.
媒体框架指的是对感知现实中的特定方面给予强调,以此来塑造公众对某个问题的理解和定义方式。其主要目的是根据作者的观点和立场影响大众的看法,然而,在立场与媒体框架之间相互作用的研究仍然不够充分。本研究采用跨学科的方法,利用互联网梗图探讨气候变化领域的这一互动现象。我们创建了CLIMATEMEMES数据集,这是首个将立场和媒体框架标注相结合的气候变化相关梗图集合,该数据集借鉴了传播科学领域的研究成果。CLIMATEMEMES包括从47个不同的Reddit子版块收集到的1,184张梗图,这使我们能够分析不同时间段与社群中框架的重要程度,并揭示了持不同立场的人群在框架使用上的偏好。 我们在本研究中提出了两个梗图理解任务:立场检测和媒体框架识别。在多种设定下评估了LLaVA-NeXT和Molmo的性能并报告了它们大模型(LLM)基座的相关结果。实验显示,人工注释始终能提升表现水平;合成文本及人手校正的光学字符识别(OCR)也有时能够提供帮助。我们的发现表明,视觉语言模型在处理立场相关任务上表现出色,但对框架识别则显得力不从心,而大语言模型在此方面超越了视觉语言模型的表现。 最后,我们分析了视觉语言模型在理解和处理气候变化互联网梗图中复杂框架和立场表达方面的局限性。
https://arxiv.org/abs/2505.16592
Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.
在动态且不可预测的测试条件下保持自动驾驶系统的强大3D感知能力仍然是一个关键挑战。现有的测试时间适应(TTA)方法由于不稳定优化和尖锐极小值的存在,往往在如3D物体检测等高变异性任务中效果不佳。最近基于线性模式连接性的模型合并策略虽然通过在线性空间内插多个微调检查点提供了更好的稳定性,但计算成本高昂,需要重复访问检查点并进行多次前向传播。 为此,在本文中我们提出了一种轻量级且可扩展的模型融合框架——CodeMerge。该框架能够在紧凑的潜在空间中运作,从而绕过了上述限制。不同于加载完整模型,CodeMerge 通过源自源模型倒数第二层特征的低维指纹来表示每个检查点,并构建键值代码本(codebook)。我们利用这些指纹上的岭杠杆得分(ridge leverage scores)计算合并系数,从而使高效的模型组合成为可能,同时不牺牲适应质量。我们的方法在具有挑战性的基准测试中表现出色,在nuScenes-C上将端到端3D检测提高了14.9%的NDS,并在nuScenes-to-KITTI的数据集上使基于LiDAR的检测准确率(mAP)提高超过7.6%,同时无需额外训练即对下游任务如在线地图绘制、运动预测和规划带来了显著改善。代码与预训练模型将在补充材料中发布。
https://arxiv.org/abs/2505.16524
According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape.
根据美国环境保护署(EPA)的数据,只有25%的废物被回收利用,并且仅有60%的美国城市提供路边回收服务。塑料的表现更糟,其回收率仅为8%,另有16%被焚烧,而剩余的76%最终进入垃圾填埋场。低塑料回收率的原因包括污染、经济激励不足以及技术难度,这些因素使得高效的塑料回收变得极具挑战性。为了改善材料回收效率,自动分拣系统发挥着关键作用。 一些公司如AMP Robotics和Greyparrot利用光学系统进行分类,而物料回收设施(MRFs)则使用近红外(NIR)传感器来检测塑料类型。现代光学分类技术运用了计算机视觉的进步,比如目标识别和实例分割,并通过机器学习支持这些技术。Mask R-CNN等两阶段探测器采用区域提议与深层骨干网如ResNet相结合的方法进行分类。而像YOLO这样的单阶段探测器则能在一次处理中完成检测任务,尽管牺牲了一定的精度以换取速度。 在理想的条件和大量标记训练数据的情况下,这些方法表现出色。然而,在实际应用中,由于各种挑战,例如光线变化、标签不一致等问题,光学识别技术的有效性受到了限制。为了进一步探究这些问题,本研究收集了来自不同来源共计20,000多张图像的新型数据集,并使用公共和定制机器学习管道评估了光学识别在分拣中的能力和局限。 我们利用Grad-CAM(Gradient-weighted Class Activation Mapping)、热图以及混淆矩阵来解释模型的行为。我们在从编译的数据集中自训练的模型上进行了这种分析。最终,我们的研究发现表明,在物料回收设施中准确分类真实世界的塑料方面,光学识别方法仅取得了有限的成功,其主要原因在于它们依赖于颜色和形状等物理特性。 这项研究表明了尽管现有技术在理论上具有潜力,但在实际操作环境中有效应用仍面临诸多挑战,特别是在处理复杂多样的环境条件时。这强调了进一步开发更加智能且适应性强的分类系统的必要性,以便能够更好地应对这些现实世界的难题。
https://arxiv.org/abs/2505.16513
In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.
近年来,深度伪造技术的快速发展引发了一种新的且严重的公共安全威胁:基于扩散模型的数字人类生成。与传统的面部篡改方法不同,这些模型可以通过多模态控制信号产生高度逼真的视频,并保持一致性。它们的灵活性和隐蔽性对现有的检测策略构成了严峻挑战。为了弥补这一差距,我们推出了DigiFakeAV,这是第一个基于扩散模型的大规模多模态数字人类伪造数据集。该数据集采用五种最新的数字人类生成方法(如Sonic、Hallo等)以及语音克隆技术,系统地创建了一个包含60,000个视频(共840万帧)的数据集,涵盖了多种国籍、肤色、性别和现实场景,显著增强了数据的多样性和逼真度。用户研究表明,伪造视频与真实视频之间的混淆率达到了68%,并且现有的最先进的检测模型在DigiFakeAV上的AUC值大幅下降,突显了该数据集所带来的挑战。 为了解决这个问题,我们进一步提出了DigiShield,这是一种基于时空和跨模态融合的检测基准。通过同时建模视频的三维时空特征以及音频的语义-声学特征,DigiShield在DigiFakeAV和DF-TIMIT两个数据集上均达到了最先进的性能水平。实验表明,该方法能够通过精细分析合成视频中面部特征随时间演变所产生的细微痕迹来有效识别隐蔽性伪造内容。
https://arxiv.org/abs/2505.16512
The notion of relevance was proposed for stability of justification status of a single argument in incomplete argumentation frameworks (IAFs) in 2024 by Odekerken et al. To extend the notion, we study the relevance for stability of verification status of a set of arguments in this paper, i.e., the uncertainties in an IAF that have to be resolved in some situations so that answering whether a given set of arguments is an extension obtains the same result in every completion of the IAF. Further we propose the notion of strong relevance for describing the necessity of resolution in all situations reaching stability. An analysis of complexity reveals that detecting the (strong) relevance for stability of sets of arguments can be accomplished in P time under the most semantics discussed in the paper. We also discuss the difficulty in finding tractable methods for relevance detection under grounded semantics.
2024年,Odekerken等人提出了相关性(relevance)的概念,以确保在不完整的论证框架(IAFs)中单个论据的正当化状态的稳定性。为了扩展这一概念,本文研究了一组论据在其验证状态稳定性的相关性问题,即在某些情况下必须解决的IAF中的不确定性,以便回答给定的一组论据是否为扩展的问题,在IAF的所有完成形式下都能得到相同的结果。此外,我们还提出了强相关性的概念来描述在所有达到稳定性的情况下进行解决的必要性。 复杂性分析表明,在论文讨论的主要语义下,检测一组论据(强)相关性以实现稳定性的任务可以在P时间内完成。我们还讨论了在接地语义下寻找可处理的相关性检测方法面临的困难。
https://arxiv.org/abs/2505.16507
Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.
大型语言模型(LLMs)迅速成为自然语言处理的核心,通过提示技术展示了其适应各种任务的能力,其中包括情感分析。然而,我们对这些模型如何捕捉与情感相关的信息仍缺乏深入的理解。本研究探索了Llama模型的隐藏层,以确定情感特征在何处最集中,并评估这对其情感分析的影响。使用探测分类器,我们在不同层次和尺度上分析情感编码,识别出最佳捕获情感信号的层次和池化方法。我们的结果显示,在二元极性任务中,情感信息最为集中在中间层,相较于提示技术,检测准确率提高了高达14%。此外,我们还发现,在解码器独有模型中,并非最后一个标记始终是情感编码中最具信息量的部分。最后,这种方法使得情感任务的内存需求平均减少了57%。这些见解有助于更全面地理解LLMs中的情感,表明特定层次探测方法作为超越提示的情感任务的有效途径,具有增强模型实用性和减少内存需求的潜力。
https://arxiv.org/abs/2505.16491
We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.
我们提出了一种新颖的隐式神经形状优化框架,用于三维高对比度电阻抗断层成像(EIT)。该方法针对导电性在材料界面处表现出急剧不连续性的场景。这些高对比度的情况常见于金属植入物监测和工业缺陷检测中,并且由于严重的不适定问题给传统的重建方法带来了挑战。 我们的方法将形状优化与隐式神经表示相结合,引入了包括基于形状导数的优化方案在内的关键创新,该方案明确地结合了高对比度界面条件。此外,我们还提出了一种高效的潜在空间表示来减少变量维度。 通过严格的算法收敛性理论分析和大量的数值实验,我们展示了显著的性能改进,并确立了我们的框架在医学成像(如含金属植入物)及工业无损检测等实际应用中的前景。
https://arxiv.org/abs/2505.16487