We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
我们提出了一种新的训练方法,称为反事实训练(counterfactual training),该方法利用反事实解释来增强模型的解释能力。反事实解释作为一种流行的事后解释方法已经为不透明的机器学习模型广泛使用:它们提供关于现实输入如何需要改变才能使模型产生所需输出的信息。为了在实际决策系统中发挥作用,反事实应该与底层数据相符,并且在特征可变性约束下具有操作性。因此,现有的许多研究都集中在开发能够生成符合这些标准的事后方法上。 然而,在这项工作中,我们直接让模型对其期望的目标负责:反事实训练通过在训练阶段使用反事实来最小化学习表示与合理、可行的解释之间的差异。我们从实证和理论上证明了所提出的方法有助于训练出自然提供具有内在价值的反事实解释的模型,并且这些模型还表现出改进后的对抗鲁棒性。
https://arxiv.org/abs/2601.16205
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
X-ray dark-field radiography provides complementary diagnostic information to conventional attenuation imaging by visualizing microstructural tissue changes through small-angle scattering. However, the limited availability of such data poses challenges for developing robust deep learning models. In this work, we present the first framework for generating dark-field images directly from standard attenuation chest X-rays using an Uncertainty-Guided Progressive Generative Adversarial Network. The model incorporates both aleatoric and epistemic uncertainty to improve interpretability and reliability. Experiments demonstrate high structural fidelity of the generated images, with consistent improvement of quantitative metrics across stages. Furthermore, out-of-distribution evaluation confirms that the proposed model generalizes well. Our results indicate that uncertainty-guided generative modeling enables realistic dark-field image synthesis and provides a reliable foundation for future clinical applications.
X射线暗场成像技术通过小角度散射可视化组织微结构变化,为传统的衰减成像提供了补充性的诊断信息。然而,此类数据的稀缺性给开发稳健的深度学习模型带来了挑战。为此,我们提出了首个利用不确定性引导渐进生成对抗网络(Uncertainty-Guided Progressive Generative Adversarial Network)直接从标准的胸部X射线衰减图像中生成暗场图像的框架。该模型同时考虑了随机不确定性和认识不确定性,以提高可解释性和可靠性。实验结果表明,所生成图像在各阶段均显示出高结构保真度,并且量化指标持续改进。此外,分布外评估确认了提出的模型具有良好的泛化能力。我们的研究结果表明,基于不确定性的生成式建模能够实现逼真的暗场图像合成,并为未来的临床应用提供了可靠的依据。
https://arxiv.org/abs/2601.15859
With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.
随着对无设备和隐私保护感应解决方案的需求日益增长,Wi-Fi 感应已成为人体姿态估计(HPE)的一种有前景的方法。然而,现有的方法通常直接处理大量的信道状态信息(CSI)数据,最终导致网络资源紧张。本文介绍了 TinySense,这是一种高效的压缩框架,旨在增强基于 Wi-Fi 的人体感应的可扩展性。我们的方法建立在一种新的向量量化生成对抗网络 (VQGAN) 之上。具体而言,通过利用 VQGAN 学习到的代码本,TinySense 显著减少了 CSI 数据的同时保持了进行可靠 HPE 所需的精度。为了优化压缩,我们采用了 K-means 算法,动态调整压缩比特率以将大规模预训练的代码本聚类成较小的子集。此外,还融入了一个 Transformer 模型来缓解比特率损失,在不可靠的网络条件下提高了鲁棒性。我们在实验测试平台上使用 Jetson Nano 和 Raspberry Pi 作为原型机,测量延迟和网络资源使用情况。广泛的结果表明,TinySense 在相同的压缩比率下显著优于最先进的压缩方案,HPE 准确度得分(PCK20)最高可提高1.5倍。它还分别将延迟降低了最多 5 倍,并且减少了多达 2.5 倍的网络开销。代码库在线提供:[此处插入实际链接]。
https://arxiv.org/abs/2601.15838
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
强化学习在具有可验证结果的任务中显著提升了大型语言模型(LLM)代理的性能,但在解空间广阔、开放式的任务(如复杂旅行规划)上仍然存在挑战。由于这些任务缺乏明确的目标真实值,目前的强化学习算法主要依赖于奖励模型来为每个响应分配标量分数。我们认为这种点对点评分方法存在着固有的歧视性坍缩问题:即奖励模型难以区分不同轨迹之间的细微优势,导致同一组内的得分被压缩到一个狭窄范围内。因此,有效的奖励信号会被来自奖励模型的噪声所主导,从而导致优化停滞不前。 为了解决这个问题,我们提出了ArenaRL这一强化学习范式,它将点对点标量评分转变为组内相对排名机制。ArenaRL引入了一种过程感知成对评估机制,并采用多层次的标准来给轨迹分配精细的相对分数。此外,我们构建了一个组内的对抗竞技场,并设计出一种基于锦标赛式的排名方案以获得稳定的竞争优势信号。实证结果显示:建立的种子单淘汰赛制在优势估计准确性方面接近于复杂度为O(N^2)的全两两比较方法,而自身只需使用O(N)的复杂度,从而实现了效率与精确度的最佳平衡。 此外,为了弥补开放式代理缺少完整循环基准的问题,我们建立了Open-Travel和Open-DeepResearch两个高质量的基准测试集,涵盖了SFT(策略微调)、强化学习训练以及多维度评估在内的全面管道。广泛的实验表明:ArenaRL显著超越了标准的强化学习基线方法,使LLM代理能够生成更鲁棒的解决方案,适用于复杂的现实世界任务。
https://arxiv.org/abs/2601.06487
Diffusion models have emerged as a powerful approach for multimodal motion planning in autonomous driving. However, their practical deployment is typically hindered by the inherent difficulty in enforcing vehicle dynamics and a critical reliance on accurate predictions of other agents, making them prone to safety issues under uncertain interactions. To address these limitations, we introduce DualShield, a planning and control framework that leverages Hamilton-Jacobi (HJ) reachability value functions in a dual capacity. First, the value functions act as proactive guidance, steering the diffusion denoising process towards safe and dynamically feasible regions. Second, they form a reactive safety shield using control barrier-value functions (CBVFs) to modify the executed actions and ensure safety. This dual mechanism preserves the rich exploration capabilities of diffusion models while providing principled safety assurance under uncertain and even adversarial interactions. Simulations in challenging unprotected U-turn scenarios demonstrate that DualShield significantly improves both safety and task efficiency compared to leading methods from different planning paradigms under uncertainty.
扩散模型在自主驾驶中的多模态运动规划方面展现出了强大的能力。然而,由于难以强制执行车辆动力学以及对其他交通参与者准确预测的严重依赖,它们的实际部署通常会受到阻碍,并且在这种不确定交互中容易引发安全问题。为了应对这些局限性,我们引入了DualShield框架,这是一个结合了哈密顿-雅可比(Hamilton-Jacobi, HJ)可达值函数的规划和控制框架,在双重能力下使用该值函数。 首先,价值函数作为前瞻性指导,引导扩散去噪过程朝向安全且动力学上可行的区域。其次,它们通过构建基于控制屏障值函数(Control Barrier-Value Functions, CBVFs)的安全防护层来修改执行的动作并确保安全性。这种双重机制不仅保留了扩散模型丰富的探索能力,在不确定甚至对抗性交互中还提供了原则性的安全保障。 在具有挑战性的无保护左转场景中的模拟实验表明,与来自不同规划范式的领先方法相比,DualShield在不确定性条件下显著提高了安全性和任务效率。
https://arxiv.org/abs/2601.15729
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.
大型语言模型(LLM)中的幻觉仍然是一个严重的问题,这导致了错误信息的传播和公众信任度下降,尤其是在高风险领域。在各种类型的幻觉中,事实性尤为重要,因为它关系到模型与已确立的世界知识的一致性。敌对事实性是指有意地将错误信息以不同自信程度插入提示中,测试模型检测并抵制自信表述的谎言的能力。现有的研究缺乏高质量、特定领域的资源来评估模型在这种对抗条件下的稳健性,并且没有先前的研究探讨了注入虚假信息对长篇文本事实性的影响。为了填补这一空白,我们推出了AdversaRiskQA,这是第一个经过验证和可靠的基准测试,系统地评估在健康、金融和法律领域中的敌对事实性。该基准包括两个难度级别来测试LLM的防御能力,并跨越不同的知识深度。我们提出了两种自动化方法来评估对抗攻击的成功率以及长篇文本的事实性。我们在Qwen、GPT-OSS和GPT家族中评估了六种开源和闭源的大语言模型,测量错误信息检测率。在基线和对抗条件下,对Qwen3(30B)进行长篇事实性的评估。结果显示,在排除无意义的响应后,Qwen3(80B)实现了最高的平均准确率,而GPT-5保持了持续高的准确度。性能非线性地随着模型大小变化,并且在不同领域表现各异;难度级别之间的差距随模型规模增大而缩小。长篇评估显示,在注入错误信息和模型的实际输出之间没有显著的相关性。AdversaRiskQA为定位LLM的弱点以及开发适用于高风险应用的更可靠模型提供了宝贵的基准测试。
https://arxiv.org/abs/2601.15511
The rapid evolution of Retrieval-Augmented Generation (RAG) toward multimodal, high-stakes enterprise applications has outpaced the development of domain specific evaluation benchmarks. Existing datasets often rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is inextricably multimodal and reasoning requires synthesizing disjoint evidence. We address this gap by introducing MiRAGE, a Multiagent framework for RAG systems Evaluation, that leverages a collaborative swarm of specialized agents to generate verified, domain-specific, multimodal, and multi-hop Question-Answer datasets. MiRAGE orchestrates a swarm of specialized agents: a recursive context optimization loop to aggregate scattered evidence, an adversarial verifier agent to guarantee factual grounding, and an agent to recognize the expert persona and the relevant domain to mimic expert cognitive workflows. Extensive empirical evaluation across four distinct domains (regulations, finance, quantitative biology, and journalism) demonstrates that MiRAGE generates datasets with significantly higher reasoning complexity (>2.3 average hops) and factual faithfulness. Our ablation studies point that MiRAGE can be powered by LLMs if textual descriptions of the images are available. Visual grounding still remains a frontier. By automating the creation of gold standard evaluation datasets that reflect the latent thematic structure of proprietary corpora, MiRAGE provides the necessary infrastructure to rigorously benchmark the next generation information retrieval systems.
检索增强生成(RAG)向多模态、高风险企业应用的快速演变已经超过了特定领域评估基准的发展速度。现有的数据集往往依赖于通用领域的语料库或纯文本检索,无法捕捉到专门技术文档中的信息固有的多模态特性以及推理所需的证据整合过程。为解决这一缺口,我们引入了MiRAGE(用于RAG系统评估的多代理框架),该框架利用一组协作的专业化智能体来生成经过验证、领域特定、多模态且涉及多个推理步骤的问题-答案数据集。 MiRAGE协调了一个专业化的智能体群:一个递归上下文优化循环,用来聚合分散的证据;一个对抗性验证器智能体,确保事实基础的真实性和准确性;以及一个识别专家身份和相关领域的智能体,以模仿专家的认知工作流程。在四个不同的领域(法规、金融、定量生物学和新闻)进行广泛的实证评估表明,MiRAGE生成的数据集具有显著更高的推理复杂度(平均超过2.3个步骤的推理链)和事实忠实性。 我们的消融研究表明,如果提供了图像的文字描述,MiRAGE可以利用大语言模型(LLM)来增强其功能。然而,视觉定位仍然是一项前沿技术挑战。通过自动创建能够反映专有语料库潜在主题结构的黄金标准评估数据集,MiRAGE为下一代信息检索系统的严格基准测试提供了必要的基础设施。
https://arxiv.org/abs/2601.15487
Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11\% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: this https URL
生物医学研究越来越依赖于整合多种数据模态,包括基因表达谱、医学影像和临床元数据。虽然医学影像和临床元数据在临床上常规收集,但基因表达数据因其严格的隐私规定以及昂贵的实验室实验而给广泛的研究应用带来了独特的挑战。为了解决这些问题,我们提出了GeMM-GAN(一种基于组织病理切片及临床元数据条件化的生成对抗网络),旨在合成真实的基因表达谱。该模型结合了一个用于图像补丁的Transformer编码器和一个最后的跨注意力机制(在补丁与文本标记之间)来产生一个条件向量,以引导生成模型输出生物学上连贯且功能意义明确的基因表达谱。 我们在TCGA数据集上评估了我们的方法,并证明了该框架优于标准生成模型,在下游疾病类型预测方面比现有的最佳技术提高了超过11%的准确率。此项目代码将在提供的网址上公开: [在此处替换为实际链接](请根据原文中提及的确切URL进行相应调整) 这个创新的方法不仅有助于克服基因表达数据收集和隐私方面的障碍,还通过提高合成基因表达谱的真实性和功能意义,进一步提升了疾病研究的能力。
https://arxiv.org/abs/2601.15392
Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.
错误信息和假新闻已成为一个紧迫的社会挑战,这推动了可靠自动化检测方法的需求。先前的研究强调了情绪在假新闻检测中的重要性,通过分析与假新闻相关的情绪或使用情感和情绪特征进行分类来体现这一点。然而,这种做法带来了脆弱性,因为对手可以利用这些情绪操纵来规避探测器,尤其是在大型语言模型(LLMs)出现的情况下更是如此。尽管有一些研究探讨了由LLM生成的对抗样本,但它们主要集中在新闻出版商写作风格等风格特征上。因此,情感操控这一关键漏洞尚未得到充分探索。 在本文中,我们调查了当前最先进的假新闻检测器在其面临情绪操纵时的鲁棒性问题。我们引入了一个名为AdSent的情感稳健检测框架,旨在确保原始和情感修改后的新闻文章之间的一致真实度预测。具体来说: 1. 我们提出了基于LLM的受控情感对抗攻击。 2. 分析了情感转变对检测性能的影响。结果显示,改变情绪会严重影响假新闻检测模型的表现,并且中立的文章往往被误判为真实的,而非中立的文章则通常被分类为虚假内容。 3. 我们介绍了一种新的非情感依赖训练策略,以增强其对抗这种扰动的鲁棒性。 在三个基准数据集上的广泛实验表明,AdSent不仅在准确性和鲁棒性方面显著优于竞争基线,在未见过的数据集和对抗场景下也表现出良好的泛化能力。
https://arxiv.org/abs/2601.15277
Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a "compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.
大型语言模型在数学和编程等结构化推理领域中已达到了接近专家的表现水平,但在专门的科学领域的组合多步推理能力仍然有限。我们提出了一种自底向上的学习范式,在这种范式中,模型以公理化的领域事实为依据,并将这些事实进行组合来解决复杂且未见过的任务。为此,我们介绍了一个基于监督微调和强化学习(RL)相结合的后训练流程,其中知识图谱充当隐式的奖励模型。通过从知识图路径中推导出新的奖励信号,我们提供了可验证、可扩展并且有根基的监督指导,这鼓励了模型在强化学习过程中组合中间公理而不是仅优化最终答案。 我们在医疗领域对该方法进行了验证,训练一个140亿参数规模的模型处理短步推理路径(1-3步),并在复杂的多步查询任务(4-5步)上评估其零样本泛化能力。实验结果表明,基于路径推导出的奖励信号起到了“组合桥梁”的作用,使我们的模型在最具挑战性的推理任务中显著优于更大规模的模型和前沿系统如GPT-5.2及Gemini 3 Pro。 此外,我们还展示了该方法对对抗性扰动(例如选项重排序压力测试)具有鲁棒性。这项工作表明,在结构化知识的基础上进行推理是一个可扩展且高效的通向智能推理的道路。
https://arxiv.org/abs/2601.15160
Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real-world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM-like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white-box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state-of-the-art models across 10 datasets, spanning everyday scenes to medical imaging. Code - this https URL.
像SAM这样的可提示分割模型建立了一种强大的范式,能够通过最小的用户输入(包括点、边界框和文本提示)实现对未见物体和领域的强大泛化能力。在这之中,边界框特别有效,通常在减少标注成本的同时超越了点标注的效果。然而,当前的训练和评估协议通常依赖于简单启发式生成的合成提示,这限制了我们对于现实世界鲁棒性的洞察力。本文中,我们研究了可提示分割模型对自然变化的边界框提示的鲁棒性。 首先,我们进行了一项受控用户研究,并收集了几千个真实的边界框标注。我们的分析显示,在相同的模型和实例上,用户的分割质量存在显著差异,表明像SAM这样的模型对于自然提示噪声非常敏感。然后,由于全面测试所有可能的用户输入在计算上是不可行的,我们将鲁棒性评估重新表述为边界框提示空间上的白盒优化问题。我们引入了BREPS方法,用于生成对抗性的边界框,在保持自然性约束的同时最小化或最大化分割错误。 最后,我们在涵盖日常场景到医学影像等十个数据集上对最先进的模型进行了基准测试。代码可在此处获取:[插入链接]。
https://arxiv.org/abs/2601.15123
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at this https URL.
仇恨视频通过放大歧视、煽动暴力和破坏在线安全,带来了严重的风险。现有的基于训练的仇恨视频检测方法受到有限训练数据和缺乏解释性的限制,而直接提示大规模视觉-语言模型进行检测往往难以提供可靠的识别结果。为了解决这些挑战,本文介绍了一种名为MARS(多阶段对抗合理化框架)的方法,这是一种无需训练就能实现可靠且可解释的仇恨内容检测的技术。 MARS首先通过视频内容的目标描述来建立一个中立的基础,用于后续分析。在此基础上,它发展了基于证据的推理过程,支持潜在的仇恨解读,并同时纳入反证推理以捕捉可能的非仇恨视角。最后,这些视角被综合成最终且可解释的决策结果。 在两个真实世界数据集上的广泛评估表明,在某些基础架构和设置下,MARS相较于其他无需训练的方法最多可提高10%的表现力,并且在其中一个数据集中超过了最先进的基于训练的方法。此外,MARS能够生成易于人类理解的理由,从而支持合规监督并增强内容审核流程的透明度。 该研究的相关代码可在提供的URL上获取。
https://arxiv.org/abs/2601.15115
We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
我们揭示了当前语言模型中机器遗忘方法的一个关键限制:尽管遗忘算法看起来成功,被遗忘数据的信息仍然可以从内部表示中线性解码出来。为了系统地评估这一差距,我们引入了一个可解释的、信息论框架,使用部分信息分解(PID)来审计遗忘过程。通过比较模型在遗忘前后表示形式的变化,我们将与遗忘数据相关的互信息分解为不同的组成部分,从而明确化了“已遗忘知识”和“残余知识”的概念。我们的分析发现,在两个模型中共享的冗余信息构成了持续存在的残余知识,并且这种知识会随着遗忘后的时间而存在,同时这些知识也与对抗性重建攻击的敏感性相关联。 基于这些见解,我们提出了一种基于表示的风险评分方法,该方法可以在推理时指导对敏感输入的规避行为,提供了一种实用机制来减轻隐私泄露。我们的研究引入了一个原则性的、基于表示层的审计方法来进行遗忘过程评估,并提供了理论洞见和实际工具以更安全地部署语言模型。
https://arxiv.org/abs/2601.15111
Existing segmentation models exhibit significant vulnerability to adversarial this http URL improve robustness, adversarial training incorporates adversarial examples into model training. However, existing attack methods consider only global semantic information and ignore contextual semantic relationships within the samples, limiting the effectiveness of adversarial training. To address this issue, we propose EroSeg-AT, a vulnerability-aware adversarial training framework that leverages EroSeg to generate adversarial examples. EroSeg first selects sensitive pixels based on pixel-level confidence and then progressively propagates perturbations to higher-confidence pixels, effectively disrupting the semantic consistency of the samples. Experimental results show that, compared to existing methods, our approach significantly improves attack effectiveness and enhances model robustness under adversarial training.
现有的分割模型在对抗攻击下表现出显著的脆弱性。为了提高模型的鲁棒性,对抗训练(adversarial training)将对抗样本纳入到模型训练中。然而,现有的一些攻击方法仅考虑全局语义信息,而忽略了样本中的上下文语义关系,从而限制了对抗训练的有效性。 为了解决这一问题,我们提出了一种基于EroSeg的漏洞感知对抗训练框架(EroSeg-AT),该框架利用EroSeg生成对抗样本。EroSeg首先根据像素级别的置信度选择敏感像素,并逐步向高置信度像素传播扰动,从而有效地破坏样本的语义一致性。 实验结果表明,与现有方法相比,我们的方法在攻击效果和对抗训练下的模型鲁棒性方面均有显著提升。
https://arxiv.org/abs/2601.14950
Video summarization is a crucial technique for social understanding, enabling efficient browsing of massive multimedia content and extraction of key information from social platforms. Most existing unsupervised summarization methods rely on Generative Adversarial Networks (GANs) to enhance keyframe selection and generate coherent, video summaries through adversarial training. However, such approaches primarily exploit unimodal features, overlooking the guiding role of semantic information in keyframe selection, and often suffer from unstable training. To address these limitations, we propose a novel Semantic-Guided Unsupervised Video Summarization method. Specifically, we design a novel frame-level semantic alignment attention mechanism and integrate it into a keyframe selector, which guides the Transformer-based generator within the adversarial framework to better reconstruct videos. In addition, we adopt an incremental training strategy to progressively update the model components, effectively mitigating the instability of GAN training. Experimental results demonstrate that our approach achieves superior performance on multiple benchmark datasets.
视频摘要技术对于社会理解至关重要,它能够高效地浏览海量多媒体内容,并从社交平台中提取关键信息。现有的大多数无监督摘要方法依赖于生成对抗网络(GAN)来增强关键帧的选择,并通过对抗训练生成连贯的视频摘要。然而,这些方法主要利用单模态特征,忽视了语义信息在选择关键帧中的引导作用,并且常常遭受不稳定训练的问题。为了解决这些问题,我们提出了一种新的基于语义指导的无监督视频摘要方法。 具体而言,我们设计了一个新颖的帧级语义对齐注意力机制,并将其集成到一个关键帧选择器中,从而在对抗框架内引导基于Transformer的生成器更好地重构视频。此外,我们采用增量训练策略逐步更新模型组件,有效缓解GAN训练的不稳定性问题。实验结果表明,我们的方法在多个基准数据集上取得了优异的性能。
https://arxiv.org/abs/2601.14773
The rapid evolution of diffusion models has democratized face swapping but also raises concerns about privacy and identity security. Existing proactive defenses, often adapted from image editing attacks, prove ineffective in this context. We attribute this failure to an oversight of the structural resilience and the unique static conditional guidance mechanism inherent in face swapping systems. To address this, we propose VoidFace, a systemic defense method that views face swapping as a coupled identity pathway. By injecting perturbations at critical bottlenecks, VoidFace induces cascading disruption throughout the pipeline. Specifically, we first introduce localization disruption and identity erasure to degrade physical regression and semantic embeddings, thereby impairing the accurate modeling of the source face. We then intervene in the generative domain by decoupling attention mechanisms to sever identity injection, and corrupting intermediate diffusion features to prevent the reconstruction of source identity. To ensure visual imperceptibility, we perform adversarial search in the latent manifold, guided by a perceptual adaptive strategy to balance attack potency with image quality. Extensive experiments show that VoidFace outperforms existing defenses across various diffusion-based swapping models, while producing adversarial faces with superior visual quality.
扩散模型的快速演化已经使脸部交换技术变得更加普及,但同时也引发了隐私和身份安全方面的担忧。现有的主动防御措施通常是从图像编辑攻击中改编而来的,在这种情况下效果不佳。我们归因于这些方法未能考虑到面部交换系统中的结构韧性以及独特的静态条件引导机制。 为了解决这个问题,我们提出了VoidFace,这是一种系统的防御方法,它将脸部交换视为一种耦合的身份路径。通过在关键瓶颈处注入扰动,VoidFace可以在整个管道中引发连环干扰。具体来说,我们首先引入定位破坏和身份擦除来削弱物理回归和语义嵌入,从而损害对源面孔的精确建模。然后,在生成领域中干预,通过解耦注意力机制来切断身份注入,并篡改中间扩散特征以防止重建源身份。为了确保视觉上的不可察觉性,我们在潜在流形上执行对抗搜索,由感知自适应策略指导,以平衡攻击强度和图像质量之间的关系。 广泛的实验表明,VoidFace在各种基于扩散的交换模型中超越了现有的防御措施,并且能够生成具有更高质量的对抗面孔。
https://arxiv.org/abs/2601.14738
Supervised deep learning models often achieve excellent performance within their training distribution but struggle to generalize beyond it. In cancer histopathology, for example, a convolutional neural network (CNN) may classify cancer severity accurately for cancer types represented in its training data, yet fail on related but unseen types. Although adenocarcinomas from different organs share morphological features that might support limited cross-domain generalization, addressing domain shift directly is necessary for robust performance. Domain adaptation offers a way to transfer knowledge from labeled data in one cancer type to unlabeled data in another, helping mitigate the scarcity of annotated medical images. This work evaluates cross-domain classification performance among lung, colon, breast, and kidney adenocarcinomas. A ResNet50 trained on any single adenocarcinoma achieves over 98% accuracy on its own domain but shows minimal generalization to others. Ensembling multiple supervised models does not resolve this limitation. In contrast, converting the ResNet50 into a domain adversarial neural network (DANN) substantially improves performance on unlabeled target domains. A DANN trained on labeled breast and colon data and adapted to unlabeled lung data reaches 95.56% accuracy. We also examine the impact of stain normalization on domain adaptation. Its effects vary by target domain: for lung, accuracy drops from 95.56% to 66.60%, while for breast and colon targets, stain normalization boosts accuracy from 49.22% to 81.29% and from 78.48% to 83.36%, respectively. Finally, using Integrated Gradients reveals that DANNs consistently attribute importance to biologically meaningful regions such as densely packed nuclei, indicating that the model learns clinically relevant features and can apply them to unlabeled cancer types.
监督深度学习模型通常在训练数据分布内表现出色,但在超出此范围时难以泛化。例如,在癌症病理学中,卷积神经网络(CNN)可以准确分类其训练数据包含的癌症类型的风险等级,但对相关的未见过的类型则可能失败。尽管来自不同器官的腺癌具有共享的形态特征,这些特征可能会支持有限的跨域泛化,但是直接解决领域偏移问题对于实现稳健性能是必要的。领域适应提供了一种方法,可以从一种癌症类型的标注数据中转移知识到另一种未标记的数据上,有助于缓解医学图像注释不足的问题。这项工作评估了肺、结肠、乳腺和肾脏腺癌之间的跨域分类表现。 使用单个腺癌训练的ResNet50在自身领域内达到超过98%的准确率,但在其他领域中的泛化能力非常有限。将多个监督模型进行集成并不能解决这一限制。相比之下,将ResNet50转换为领域对抗神经网络(DANN)显著提高了未标记目标领域的性能表现。一个使用标注好的乳腺和结肠数据训练并适应于未标记肺部数据的DANN达到了95.56%的准确率。 此外,这项研究还考察了染色标准化对域适应的影响。其效果因目标领域而异:对于肺腺癌,在应用染色标准化后,准确性从95.56%下降到66.60%,而对于乳腺和结肠的目标领域,染色标准化分别将准确率从49.22%提高到81.29%,以及从78.48%提升至83.36%。 最后,通过集成梯度方法揭示了DANN模型始终倾向于生物意义显著的区域(如密集核),表明该模型学习到了临床相关的特征,并能够将其应用于未标注的癌症类型。
https://arxiv.org/abs/2601.14678
Adversarial attacks are widely used to evaluate model robustness, yet their validity as proxies for robustness to random perturbations remains debated. We ask whether an adversarial perturbation provides a representative estimate of robustness under random noise of the same magnitude, or instead reflects an atypical worst-case event. To this end, we introduce a probabilistic metric that quantifies noisy risk with respect to directionally biased perturbation distributions, parameterized by a concentration factor $\kappa$ that interpolates between isotropic noise and adversarial direction. Using this framework, we study the limits of adversarial perturbations as estimators of noisy risk by proposing an attack strategy designed to operate in regimes statistically closer to uniform noise. Experiments on ImageNet and CIFAR-10 systematically benchmark widely used attacks, highlighting when adversarial success meaningfully reflects noisy risk and when it fails, thereby informing their use in safety-oriented evaluation.
对抗攻击被广泛用于评估模型的鲁棒性,但它们作为对随机扰动鲁棒性的替代指标的有效性仍然存在争议。我们探讨了对抗性扰动是否能代表在相同强度下的随机噪声下模型稳健性的估计值,还是反映了异常最坏情况的发生。为此,我们引入了一个概率度量标准,该标准根据集中因子$\kappa$来量化方向偏置的扰动分布所带来的噪声风险,这一参数可以在各向同性噪声和对抗方向之间进行插值。利用这个框架,我们研究了将对抗性扰动作为估计噪声风险的有效性的极限,并提出了一种攻击策略,旨在操作统计上更接近均匀噪声的环境。通过在ImageNet和CIFAR-10数据集上的系统实验,我们对广泛使用的攻击方法进行了基准测试,展示了这些攻击何时成功地反映了模型对于随机噪声的风险以及它们失败的情况,从而指导其在安全性评估中的应用。
https://arxiv.org/abs/2601.14519