Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
语义分割依赖于大量的密集像素级标注以实现最佳性能,但由于获取现实世界数据准确标注的难度较大,实践中通常使用大规模合成数据集进行训练。未配对图像转换是一种方法,用于通过生成更逼真的训练数据来解决由此产生的领域差距问题,在低数据环境下尤其有效。目前的未配对图像转换方法通过循环一致性训练生成对抗网络(GAN),以执行转换并强制执行像素级别的语义匹配。然而,这些方法并不能保证这种语义匹配的有效性,这在性能对噪声像素标签敏感的语义分割任务中成为了一个问题。 我们提出了一种新的图像翻译方法——领域对抗核预测网络(DA-KPN),该方法能够确保合成标签与转换后的图像之间的语义匹配。DA-KPN估算出用于简单轻量级变换函数的逐像素输入变换参数,以实现这一目标。为了保证这种逐像素的变换是真实的,DA-KPN使用多尺度判别器来区分翻译后和目标样本。 我们展示了在仅有限访问真实图像标签的情况下,DA-KPN在syn2real基准测试中优于先前基于GAN的方法,并且在面部解析任务中也达到了相当的表现。
https://arxiv.org/abs/2507.08554
Deep hiding has been exploring the hiding capability of deep learning-based models, aiming to conceal image-level messages into cover images and reveal them from generated stego images. Existing schemes are easily detected by steganalyzers due to their large payloads and their limitation to feature extraction based solely on either pure convolution or pure transformer operators within a single range, as well as pixel-level loss constraints. To address the issue, in this paper, we introduce generation-based adversarial attacks into color JPEG image deep hiding and propose a multi-range representations-driven adversarial stego generation framework called MRAG from a steganalysis perspective. Specifically, we integrate the local-range neighbor reception characteristic of the convolution and the global-range dependency modeling of the transformer to construct MRAG. Meanwhile, we use the transformed images obtained through coarse-grained and fine-grained frequency decomposition as inputs, introducing multi-grained information. Furthermore, a features angle-norm disentanglement loss is designed to constrain the generated stegos closer to covers in the angle and norm space of the steganalyzer's classified features. Consequently, small yet effective adversarial perturbations can be injected into the process of generating stegos, ensuring that stegos maintain favorable secret restorability and imperceptibility. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.
深度隐藏技术一直在探索基于深度学习模型的隐蔽能力,目标是将图像级别的信息嵌入到载体图像中,并从生成的隐写图(stego 图像)中提取出来。现有的方案由于其较大的负载量和对单一范围内纯卷积或纯变换操作特征提取的限制,以及像素级损失约束的问题,很容易被隐写分析器检测到。为解决这些问题,在这篇论文中,我们引入了基于生成对抗网络的攻击方法来改进彩色 JPEG 图像的深度隐藏,并提出了一种从隐写分析角度出发的多范围表示驱动的对抗性隐写图像生成框架(MRAG)。具体来说,我们将卷积操作中的局部邻域接收特性和变换器中的全局依赖建模结合起来构建 MRAG。同时,我们使用通过粗粒度和细粒度频率分解获得的转换后的图像作为输入,引入多级别的信息。此外,设计了一种特征角度范数解耦损失来约束生成的隐写图与载体图在分类特性空间的角度和范数上更加接近。因此,可以将较小且有效的对抗性扰动注入到生成隐写图的过程中,确保这些隐写图像具有良好的秘密恢复能力和不可察觉性。广泛的实验表明,MRAG 可以达到业界最先进的性能水平。
https://arxiv.org/abs/2507.08343
With the advancement of vision-based autonomous driving technology, pedestrian detection have become an important component for improving traffic safety and driving system robustness. Nevertheless, in complex traffic scenarios, conventional pose estimation approaches frequently fail to accurately reconstruct occluded keypoints, primarily due to obstructions caused by vehicles, vegetation, or architectural elements. To address this issue, we propose a novel real-time occluded pedestrian pose completion framework termed Separation and Dimensionality Reduction-based Generative Adversarial Imputation Nets (SDR-GAIN). Unlike previous approaches that train visual models to distinguish occlusion patterns, SDR-GAIN aims to learn human pose directly from the numerical distribution of keypoint coordinates and interpolate missing positions. It employs a self-supervised adversarial learning paradigm to train lightweight generators with residual structures for the imputation of missing pose keypoints. Additionally, it integrates multiple pose standardization techniques to alleviate the difficulty of the learning process. Experiments conducted on the COCO and JAAD datasets demonstrate that SDR-GAIN surpasses conventional machine learning and Transformer-based missing data interpolation algorithms in accurately recovering occluded pedestrian keypoints, while simultaneously achieving microsecond-level real-time inference.
随着基于视觉的自动驾驶技术的进步,行人检测已成为提高交通安全和驾驶系统鲁棒性的重要组成部分。然而,在复杂的交通场景中,传统的姿态估计方法经常无法准确重建被遮挡的关键点,主要是因为车辆、植被或建筑元素造成的阻碍。为了解决这个问题,我们提出了一种名为分离与降维生成对抗网络填充(SDR-GAIN)的实时遮挡行人姿势补全框架。 不同于以往的方法通过训练视觉模型来区分遮挡模式,SDR-GAIN旨在直接从关键点坐标的数值分布中学习人体姿态,并插值缺失的位置。它采用了一种自监督对抗性学习范式,利用轻量级生成器和残差结构进行缺失姿势关键点的补全。此外,该方法还整合了多种姿态标准化技术来缓解学习过程中的难度。 在COCO和JAAD数据集上的实验表明,SDR-GAIN在准确恢复遮挡行人的关键点方面超过了传统的机器学习和基于Transformer的缺失数据插值算法,并同时实现了微秒级的实时推理性能。
https://arxiv.org/abs/2306.03538
Humanoid robots show significant potential in daily tasks. However, reinforcement learning-based motion policies often suffer from robustness degradation due to the sim-to-real dynamics gap, thereby affecting the agility of real robots. In this work, we propose a novel robust adversarial training paradigm designed to enhance the robustness of humanoid motion policies in real worlds. The paradigm introduces a learnable adversarial attack network that precisely identifies vulnerabilities in motion policies and applies targeted perturbations, forcing the motion policy to enhance its robustness against perturbations through dynamic adversarial training. We conduct experiments on the Unitree G1 humanoid robot for both perceptive locomotion and whole-body control tasks. The results demonstrate that our proposed method significantly enhances the robot's motion robustness in real world environments, enabling successful traversal of challenging terrains and highly agile whole-body trajectory tracking.
人形机器人在日常任务中展现出巨大的潜力。然而,基于强化学习的运动策略往往由于仿真到现实的动力学差距而出现鲁棒性下降的问题,从而影响了实际机器人的敏捷性。为此,我们提出了一种新的稳健对抗训练范式,旨在提高人形机器人在真实世界中的运动策略的鲁棒性。该范式引入了一个可学习的对抗攻击网络,能够精确地识别运动策略中的漏洞并施加针对性的扰动,通过动态对抗训练迫使运动策略增强其对扰动的抵抗能力。 我们在Unitree G1人形机器人的感知行走和全身控制任务上进行了实验。结果表明,我们提出的方法显著提高了机器人在真实环境中的运动鲁棒性,使其能够成功地穿越具有挑战性的地形并实现高度敏捷的整体轨迹跟踪。
https://arxiv.org/abs/2507.08303
We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.
我们介绍了一种轻量级但高度有效的语言模型安全护栏框架,证明了小型语言模型可以在内容审核任务中实现甚至超越大型模型的性能。这一成果通过高保真的合成数据生成和对抗训练得以实现。 合成数据生成过程始于人工策划的基础数据集,并在此基础上进行查询增强和改写以创建多样且语境丰富的示例。然后对这些扩展的数据进行多轮次整理,确保其高度准确性和相关性。受到最近在生成对抗网络(GAN)架构上的进展启发,我们的对抗训练利用强化学习来指导生成器产生具有挑战性的合成样本。这些样例被用于微调安全分类器,以增强其识别和缓解有害内容的能力。 此外,我们还借鉴了近期关于高效大型语言模型(LLM)训练的研究成果,通过小型模型的性能优化,提升了更大规模生成模型的表现力。 借助迭代对抗训练以及多样化、高质量合成数据的生成,我们的框架使小规模语言模型能够成为稳健的安全保护措施。这一方法不仅减少了计算负担,还增强了抵御对抗性攻击的能力,为AI系统的相关内容审核提供了一种可扩展且高效的解决方案。
https://arxiv.org/abs/2507.08284
The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.
自主大型语言模型(LLM)代理能够使用工具的出现,引入了超出传统对话滥用的新安全风险。这些有能力执行外部功能的代理容易受到用户发起的威胁(例如敌意提示)和工具发起的威胁(例如被破坏的工具产生的恶意输出)。在本文中,我们提出了第一个用于使用工具的代理的安全对齐框架,通过结构化推理和沙箱强化学习使模型能够处理两种渠道的威胁。我们引入了一种三模态分类法,包括用户提示和工具响应中的良性、恶意和敏感类别,并定义了一个政策驱动的决策模型。我们的框架采用了一个定制设计的沙箱环境来模拟现实世界的工具执行,并允许进行细粒度奖励塑形。通过在公开和自建基准上的广泛评估,包括Agent SafetyBench、InjecAgent和BFCL,我们证明了我们的安全对齐代理显著提高了对抗安全威胁的能力,同时保持了良性任务的强大效用。我们的结果显示,可以共同优化安全性和有效性,为自主LLM代理的可信部署奠定了基础。
https://arxiv.org/abs/2507.08270
As large language models (LLMs) are increasingly deployed in critical applications, the challenge of jailbreaking, where adversaries manipulate the models to bypass safety mechanisms, has become a significant concern. This paper presents a dynamic Stackelberg game framework to model the interactions between attackers and defenders in the context of LLM jailbreaking. The framework treats the prompt-response dynamics as a sequential extensive-form game, where the defender, as the leader, commits to a strategy while anticipating the attacker's optimal responses. We propose a novel agentic AI solution, the "Purple Agent," which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT). The Purple Agent actively simulates potential attack trajectories and intervenes proactively to prevent harmful outputs. This approach offers a principled method for analyzing adversarial dynamics and provides a foundation for mitigating the risk of jailbreaking.
随着大型语言模型(LLMs)在关键应用中的部署日益增多,对手通过操控这些模型以绕过安全机制的“越狱”挑战已经成为一个重要的问题。本文提出了一种动态Stackelberg博弈框架来模拟在LLM越狱背景下攻击者和防御者之间的互动。该框架将提示-响应动态视为一种序贯扩展形式游戏,在这种游戏中,作为领导者的防御方会预先承诺某种策略,并且在预期到对手的最佳反应时采取行动。 我们提出了一种新颖的代理智能解决方案——“紫罗兰智能体”(Purple Agent),它结合了使用快速探索随机树(RRT)进行对抗性探索和防御策略的方法。紫罗兰智能体能够主动模拟潜在的攻击路径,并预先干预以防止有害输出的发生。这种方法提供了一个分析敌对动态的原则方法,为减轻越狱风险奠定了基础。
https://arxiv.org/abs/2507.08207
We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.
我们提出了自适应扩散去噪平滑法(Adaptive Diffusion Denoised Smoothing),这是一种针对视觉模型预测进行对抗样本认证的方法,并且能够根据输入情况进行调整。我们的关键见解是将一种引导式去噪扩散模型重新解释为一系列通过自适应高斯差异隐私(GDP)机制对纯噪声样本进行细化并最终生成图像的长序列操作。我们展示了这些自适应机制可以通过GDP隐私过滤器组合起来,以分析引导式去噪过程的整体鲁棒性,并提供一种可证明的认证方法,这种方法扩展了自适应随机平滑分析的应用范围。我们演示了在特定引导策略下,我们的设计可以在$\ell_2$威胁模型中同时提升ImageNet数据集上的已验证准确率和标准准确率。
https://arxiv.org/abs/2507.08163
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
无约束对抗攻击旨在通过不限制$\ell_p$-范数边界(例如,通过改变对象的颜色)来欺骗计算机视觉模型而不被人类察觉。这使攻击者能够规避传统的、基于规范的防御策略,如对抗训练或经过验证的安全策略。然而,由于其不受限制的性质,也没有提供基于范数不可察觉性的保证,这就需要人工评估以确认这些对抗性样本看起来有多逼真。尽管一些相关工作已经对这一关键属性进行了评估,但没有提供统计学上显著的见解。为了解决这个问题,我们引入了SCOOTER——一个开源、具备统计功效的框架来评估无约束的对抗示例。 我们的贡献包括:$(i)$ 用于群体研究力量、补偿和Likert等效边界以衡量不可察觉性的最佳实践指导;$(ii)$ 首次大规模的人类与模型对比实验,涉及346名参与者,结果显示三种颜色空间攻击和三种基于扩散的攻击无法产生不可察觉的图像。此外,我们发现GPT-4o可以作为初步测试来检测不可察觉性,但它仅能一致地检测出六种被测攻击中的四种;$(iii)$ 开源软件工具,包括一个用于收集注释的浏览器任务模板以及Python和R语言的分析脚本;$(iv)$ 一个基于ImageNet的数据集,包含3K张真实图像、7K个对抗性示例及超过34K条人类评分。 我们的研究结果表明,自动化的视觉系统与人类感知不一致,这强调了需要建立一个以SCOOTER为基准的真实数据集合。
https://arxiv.org/abs/2507.07776
Adversarial Training (AT) is a widely adopted defense against adversarial examples. However, existing approaches typically apply a uniform training objective across all classes, overlooking disparities in class-wise vulnerability. This results in adversarial unfairness: classes with well distinguishable features (strong classes) tend to become more robust, while classes with overlapping or shared features(weak classes) remain disproportionately susceptible to adversarial attacks. We observe that strong classes do not require strong adversaries during training, as their non-robust features are quickly suppressed. In contrast, weak classes benefit from stronger adversaries to effectively reduce their vulnerabilities. Motivated by this, we introduce TRIX, a feature-aware adversarial training framework that adaptively assigns weaker targeted adversaries to strong classes, promoting feature diversity via uniformly sampled targets, and stronger untargeted adversaries to weak classes, enhancing their focused robustness. TRIX further incorporates per-class loss weighting and perturbation strength adjustments, building on prior work, to emphasize weak classes during the optimization. Comprehensive experiments on standard image classification benchmarks, including evaluations under strong attacks such as PGD and AutoAttack, demonstrate that TRIX significantly improves worst-case class accuracy on both clean and adversarial data, reducing inter-class robustness disparities, and preserves overall accuracy. Our results highlight TRIX as a practical step toward fair and effective adversarial defense.
对抗训练(AT)是防范对抗样本的一种广泛应用的防御方法。然而,现有方法通常在整个类中应用统一的训练目标,忽视了各分类之间的脆弱性差异。这导致了对抗不公平:具有易于区分特征的类别(强类别)在对抗攻击下变得更加强健,而那些具有重叠或共享特征的类别(弱类别)则仍然容易受到攻击。我们观察到,在训练过程中,对于那些非鲁棒性的特征很快被抑制的强类来说,并不需要强大的对手。相反,弱类从更强的对手中获益,能够有效地减少它们自身的脆弱性。 基于这一发现,我们引入了TRIX,这是一种具有感知特征的对抗训练框架,它根据类别特点自适应地分配更弱的目标型攻击者给强类别,通过均匀采样的目标来促进特征多样性,并为弱类分配更强的非目标型攻击者以增强其特定方面的鲁棒性。此外,TRIX 还结合了基于先前工作的每类损失加权和扰动强度调整,强调在优化过程中对弱类的关注。 标准图像分类基准上的全面实验(包括PGD和AutoAttack等强攻击下的评估)表明,TRIX 在干净数据和对抗数据中显著提高了最差情况类别精度,并且减少了各分类之间的鲁棒性差异。同时,它还保持了总体准确性。我们的结果凸显出TRIX作为实现公平有效的对抗防御的实用步骤的重要性。
https://arxiv.org/abs/2507.07768
Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.
统一的视觉语言模型(VLM)最近取得了显著进展,使得单一模型能够在共享计算架构内通过不同的指令灵活应对多种任务。这种基于指令的控制机制带来了独特的安全挑战,即对抗性输入必须在处理同一恶意内容时对多个不可预测的任务指令保持有效。本文中,我们介绍了CrossVLAD,这是一个新的基准数据集,该数据集从MSCOCO精心整理而来,并辅以GPT-4协助标注,旨在系统地评估统一VLM跨任务的对抗攻击。CrossVLAD专注于目标对象改变的目标——即在四个下游任务中持续操纵目标对象分类,并提出了一种新颖的成功率度量标准,用于衡量所有任务中的同时误分类情况,从而严格评价对抗迁移性。 为应对这一挑战,我们提出了CRAFT(跨任务区域基攻击框架与标记对齐),这是一种高效的基于区域的攻击方法。在Florence-2和其他流行的统一VLM上的广泛实验表明,我们的方法在整体跨任务攻击性能和目标对象改变的成功率方面均优于现有方法,这凸显了其在不同任务中对抗性影响统一VLM的有效性。
https://arxiv.org/abs/2507.07709
With the increasing deployment of intelligent CCTV systems in outdoor environments, there is a growing demand for face recognition systems optimized for challenging weather conditions. Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. This leads to distorted facial textures and structures. To address these limitations, we propose a novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). The local SFFT module enhances facial structure and color fidelity by aligning the local statistical distributions of low-quality (LQ) facial regions with those of high-quality (HQ) counterparts. Complementarily, the DAFE module enables robust statistical facial feature extraction under adverse weather conditions by aligning LQ and HQ encoder representations, thereby making the restoration process adaptive to severe weather-induced degradations. Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Furthermore, both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios.
随着智能CCTV系统在户外环境中的部署日益增加,对于能够在恶劣天气条件下优化的面部识别系统的市场需求也在不断增长。不良天气条件显著降低图像质量,进而减少了识别准确率。尽管基于生成对抗网络(GAN)和扩散模型的最新面部图像恢复(FIR)方法显示出了一定的进步,但由于缺乏专门针对由天气引起的退化问题进行处理的模块,其性能仍然有限,导致了面部纹理和结构变形。 为了克服这些限制,我们提出了一种新颖的基于GAN的盲态FIR框架,该框架集成了两个关键组件:局部统计面部特征转换(SFFT)以及不受退化影响的特征嵌入(DAFE)。局部SFFT模块通过使低质量(LQ)面部区域与高质量(HQ)对应部分之间的局部统计分布对齐来增强面部结构和颜色保真度。同时,DAFE模块允许在恶劣天气条件下进行鲁棒性统计面部特征提取,通过使LQ和HQ编码器表示对齐,从而使恢复过程适应于严重的由天气引起的退化。 实验结果表明,所提出的不受退化影响的SFFT模型优于现有的基于GAN和扩散模型的FIR方法,在抑制纹理失真并准确重建面部结构方面尤其有效。此外,无论是SFFT模块还是DAFE模块都在挑战性天气场景下的面部恢复过程中被实证验证为能够增强结构保真度和感知质量。
https://arxiv.org/abs/2507.07464
Critical infrastructure systems, including energy grids, healthcare facilities, transportation networks, and water distribution systems, are pivotal to societal stability and economic resilience. However, the increasing interconnectivity of these systems exposes them to various cyber threats, including ransomware, Denial-of-Service (DoS) attacks, and Advanced Persistent Threats (APTs). This paper examines cybersecurity vulnerabilities in critical infrastructure, highlighting the threat landscape, attack vectors, and the role of Artificial Intelligence (AI) in mitigating these risks. We propose a hybrid AI-driven cybersecurity framework to enhance real-time vulnerability detection, threat modelling, and automated remediation. This study also addresses the complexities of adversarial AI, regulatory compliance, and integration. Our findings provide actionable insights to strengthen the security and resilience of critical infrastructure systems against emerging cyber threats.
关键基础设施系统,包括能源电网、医疗设施、交通运输网络和水资源分配系统,在社会稳定和经济韧性中发挥着至关重要的作用。然而,这些系统的日益互连使其暴露于各种网络安全威胁之下,例如勒索软件、拒绝服务(DoS)攻击和高级持续性威胁(APT)。本文探讨了关键基础设施中的网络安全漏洞,并强调了威胁格局、攻击途径以及人工智能在缓解这些风险方面的作用。我们提出了一种基于混合人工智能的网络安全框架,以增强实时漏洞检测、威胁建模和自动修复功能。本研究还讨论了对抗性人工智能的复杂性、监管合规性和集成问题。我们的研究成果提供了实用见解,有助于加强关键基础设施系统抵御新兴网络威胁的安全性和韧性。
https://arxiv.org/abs/2507.07416
Phishing attacks are becoming increasingly sophisticated, underscoring the need for detection systems that strike a balance between high accuracy and computational efficiency. This paper presents a comparative evaluation of traditional Machine Learning (ML), Deep Learning (DL), and quantized small-parameter Large Language Models (LLMs) for phishing detection. Through experiments on a curated dataset, we show that while LLMs currently underperform compared to ML and DL methods in terms of raw accuracy, they exhibit strong potential for identifying subtle, context-based phishing cues. We also investigate the impact of zero-shot and few-shot prompting strategies, revealing that LLM-rephrased emails can significantly degrade the performance of both ML and LLM-based detectors. Our benchmarking highlights that models like DeepSeek R1 Distill Qwen 14B (Q8_0) achieve competitive accuracy, above 80%, using only 17GB of VRAM, supporting their viability for cost-efficient deployment. We further assess the models' adversarial robustness and cost-performance tradeoffs, and demonstrate how lightweight LLMs can provide concise, interpretable explanations to support real-time decision-making. These findings position optimized LLMs as promising components in phishing defence systems and offer a path forward for integrating explainable, efficient AI into modern cybersecurity frameworks.
网络钓鱼攻击变得越来越复杂,这强调了需要开发一种既能保证高精度又具有计算效率的检测系统。本文通过在精心策划的数据集上进行实验,对传统机器学习(ML)、深度学习(DL)和量化小型参数大型语言模型(LLM)在网络钓鱼检测方面的性能进行了比较评估。 研究发现,尽管目前大型语言模型在原始准确性方面仍然不及机器学习和深度学习方法,但它们在识别细微、基于上下文的网络钓鱼线索方面表现出巨大潜力。我们还调查了零样本和少量样本提示策略对结果的影响,并揭示出由LLM重写的电子邮件会显著降低ML和基于LLM检测器的性能。 我们的基准测试显示,诸如DeepSeek R1 Distill Qwen 14B(Q8_0)这样的模型,在仅使用17GB显存的情况下即可达到超过80%的准确性,这表明它们在成本效益部署方面是可行的选择。此外,我们还评估了这些模型对对抗性攻击的鲁棒性和成本性能权衡,并展示了轻量级LLM如何提供简洁且可解释的说明来支持实时决策。 这些发现将优化后的大型语言模型定位为网络钓鱼防御系统中的有前景组件,并为进一步在现代网络安全框架中集成可解释、高效的AI技术铺平了道路。
https://arxiv.org/abs/2507.07406
With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.
随着大型语言模型(LLM)的部署增加,人们越来越担心这些模型可能被用于生成有害内容。我们的研究聚焦于“对齐挑战”,特别是探讨如何通过过滤机制防止产生不安全信息的方法。干预点主要有两个:在提示输入到达模型之前对其进行过滤,以及在输出生成之后进行过滤。 我们主要的研究结果展示了在过滤提示和输出时面临的计算难题。首先,我们证明对于某些LLM而言,不存在高效的提示过滤器:可以轻易构造出能够诱使模型产生有害行为的对抗性提示,并且这些提示与良性提示从任何高效过滤器的角度来看都是无法区分的。 其次,我们的研究发现了一种自然情境,在这种情况下输出过滤在计算上是不可行的。我们所有的分离结果均基于密码学难度假设而得出。 除了上述核心发现之外,我们也形式化并研究了较为宽松的缓解方法,并展示了进一步的计算障碍。 综上所述,我们认为仅通过设计位于LLM内部(架构和权重)外部的过滤器无法实现安全性;尤其对于黑盒访问的LLM来说更是如此。基于我们的技术结果,我们主张一个对齐的人工智能系统的智能性不能与其判断力相分离。
https://arxiv.org/abs/2507.07341
Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.
近期在大型视觉-语言模型方面的进展已经显著提升了视觉问答和多模态推理的表现。然而,这些模型是否真正具备基于视觉的推理能力,还是仅仅依赖于浅层模式和数据集偏见仍然不清楚。为此,我们引入了MagiC这一全面基准测试工具,旨在评估基于视觉的认知能力,并不仅限于答案准确性,还包括逐步骤推理质量和与相关视觉证据的一致性。我们的基准包括大约5,500个弱监督问答示例,这些示例是从强大模型的输出中生成的,以及900个人工策划并带有细粒度注释(如答案、理由和边界框定位)的示例。 我们评估了15种视觉-语言模型,范围从7B到70B参数,并针对四个维度进行评价:最终答案正确性、推理有效性、基础忠实性和自我纠正能力。此外,MagiC还包含诊断设置,用以探测模型在对抗性视觉线索下的稳健性能以及它们的内省错误修正能力。 我们引入了新的评估指标如MagiScore和StepSense,并提供了详尽分析,揭示当前基于视觉推理方法的关键限制和机会。
https://arxiv.org/abs/2507.07297
As machine learning models become increasingly deployed across the edge of internet of things environments, a partitioned deep learning paradigm in which models are split across multiple computational nodes introduces a new dimension of security risk. Unlike traditional inference setups, these distributed pipelines span the model computation across heterogeneous nodes and communication layers, thereby exposing a broader attack surface to potential adversaries. Building on these motivations, this work explores a previously overlooked vulnerability: even when both the edge and cloud components of the model are inaccessible (i.e., black-box), an adversary who intercepts the intermediate features transmitted between them can still pose a serious threat. We demonstrate that, under these mild and realistic assumptions, an attacker can craft highly transferable proxy models, making the entire deep learning system significantly more vulnerable to evasion attacks. In particular, the intercepted features can be effectively analyzed and leveraged to distill surrogate models capable of crafting highly transferable adversarial examples against the target model. To this end, we propose an exploitation strategy specifically designed for distributed settings, which involves reconstructing the original tensor shape from vectorized transmitted features using simple statistical analysis, and adapting surrogate architectures accordingly to enable effective feature distillation. A comprehensive and systematic experimental evaluation has been conducted to demonstrate that surrogate models trained with the proposed strategy, i.e., leveraging intermediate features, tremendously improve the transferability of adversarial attacks. These findings underscore the urgent need to account for intermediate feature leakage in the design of secure distributed deep learning systems.
随着机器学习模型在物联网环境的边缘部署变得越来越普遍,一种将模型拆分到多个计算节点上的分区深度学习范式引入了新的安全风险维度。与传统的推理设置不同,这种分布式管道跨异构节点和通信层分散了模型计算,从而暴露给潜在对手更广泛的攻击面。基于这些动机,这项工作探讨了一个此前被忽略的漏洞:即使当模型的边缘部分和云端组件都不可访问(即黑盒)时,拦截它们之间传输的中间特征的攻击者仍然可以构成严重威胁。我们证明,在这些温和且现实的假设下,攻击者能够构建高度可转移的代理模型,使整个深度学习系统对规避攻击变得更加脆弱。特别是,拦截到的特征可以被有效分析和利用来提取替代模型,这些替代模型能够为目标模型生成高度可转移的对抗性样本。 为此,我们提出了一种专门针对分布式环境的利用策略,该策略包括使用简单的统计分析从向量化的传输特性中重构原始张量形状,并相应地调整代理架构以实现有效的特征蒸馏。一项全面且系统的实验评估已经进行,证明了使用所提出的策略(即利用中间特征)训练替代模型显著提高了对抗性攻击的可转移性。这些发现强调了在设计安全的分布式深度学习系统时必须考虑中间特征泄漏的重要性。
https://arxiv.org/abs/2507.07259
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
近期,大型语言模型(LLM)在代码生成基准测试如HumanEval和LiveCodeBench中取得了显著的成功。然而,仔细分析后发现,这些评估套件通常只包含少量同质化的测试案例,这导致细微的错误未能被检测出来。这种状况不仅人为地夸大了测量性能,还损害了基于可验证奖励(RLVR)的强化学习框架中的准确奖赏估计。 为了解决这一关键缺陷,我们系统性地研究了测试用例生成(TCG)任务,并提出了多维度指标来严格量化测试套件的全面性。此外,我们引入了一种人类与LLM协作的方法(SAGA),该方法结合了人类编程专业知识和LLM推理能力,旨在显著提高所生成测试案例的覆盖率和质量。同时,我们开发了一个TCGBench以促进对TCG任务的研究。 实验表明,SAGA在TCGBench上实现了90.62%的检测率和32.58%的验证器准确性。与LiveCodeBench-v6相比,由SAGA合成的代码生成评估基准的验证器准确度提高了10.78%。这些结果证明了我们提出方法的有效性。 我们希望这项工作能够为可靠LLM代码评估构建可扩展的基础,并进一步推进RLVR在代码生成方面的进展,同时也为自动化对抗测试生成和自适应基准集成铺平道路。
https://arxiv.org/abs/2507.06920
Cyber-attacks jeopardize the safe operation of smart microgrids. At the same time, existing diagnostic methods either depend on expensive multi-point instrumentation or stringent modelling assumptions that are untenable under single-sensor constraints. This paper proposes a Fractional-Order Memory-Enhanced Attack-Diagnosis Scheme (FO-MADS) that achieves low-latency fault localisation and cyber-attack detection using only one VPQ (Voltage-Power-Reactive-power) sensor. FO-MADS first constructs a dual fractional-order feature library by jointly applying Caputo and Grünwald-Letnikov derivatives, thereby amplifying micro-perturbations and slow drifts in the VPQ signal. A two-stage hierarchical classifier then pinpoints the affected inverter and isolates the faulty IGBT switch, effectively alleviating class imbalance. Robustness is further strengthened through Progressive Memory-Replay Adversarial Training (PMR-AT), whose attack-aware loss is dynamically re-weighted via Online Hard Example Mining (OHEM) to prioritise the most challenging samples. Experiments on a four-inverter microgrid testbed comprising 1 normal and 24 fault classes under four attack scenarios demonstrate diagnostic accuracies of 96.6 % (bias), 94.0 % (noise), 92.8 % (data replacement), and 95.7 % (replay), while sustaining 96.7 % under attack-free conditions. These results establish FO-MADS as a cost-effective and readily deployable solution that markedly enhances the cyber-physical resilience of smart microgrids.
网络攻击威胁了智能微电网的安全运行。与此同时,现有的诊断方法要么依赖于昂贵的多点仪器设备,要么依赖于在单传感器约束下难以实现的严格建模假设。本文提出了一种分数阶记忆增强型攻击诊断方案(Fractional-Order Memory-Enhanced Attack-Diagnosis Scheme, FO-MADS),该方案仅通过使用一个VPQ(电压-功率-无功功率)传感器即可实现低延迟故障定位和网络攻击检测。 FO-MADS首先通过联合应用Caputo导数和Grünwald-Letnikov导数构建了一个双分数阶特征库,从而放大了VPQ信号中的微小扰动和缓慢漂移。随后,一个两阶段分层分类器能够精准定位受影响的逆变器并隔离故障IGBT开关,有效地缓解了类别不平衡的问题。通过渐进式记忆回放对抗训练(Progressive Memory-Replay Adversarial Training, PMR-AT)进一步增强了系统的鲁棒性,其攻击感知损失可通过在线硬样本挖掘(Online Hard Example Mining, OHEM)动态重新加权以优先处理最具挑战性的样本。 在包含四个逆变器的微电网测试平台上的实验中,在四种不同的攻击场景下分别实现了96.6%(偏置)、94.0%(噪声)、92.8%(数据替换)和95.7%(回放)的诊断准确性,而在无攻击条件下保持了96.7%的准确率。这些结果表明FO-MADS是一个成本效益高且易于部署的解决方案,能够显著提升智能微电网的网络物理系统韧性。
https://arxiv.org/abs/2507.06890