With the growing popularity of deep reinforcement learning (DRL), human-in-the-loop (HITL) approach has the potential to revolutionize the way we approach decision-making problems and create new opportunities for human-AI collaboration. In this article, we introduce a novel multi-layered hierarchical HITL DRL algorithm that comprises three types of learning: self learning, imitation learning and transfer learning. In addition, we consider three forms of human inputs: reward, action and demonstration. Furthermore, we discuss main challenges, trade-offs and advantages of HITL in solving complex problems and how human information can be integrated in the AI solution systematically. To verify our technical results, we present a real-world unmanned aerial vehicles (UAV) problem wherein a number of enemy drones attack a restricted area. The objective is to design a scalable HITL DRL algorithm for ally drones to neutralize the enemy drones before they reach the area. To this end, we first implement our solution using an award-winning open-source HITL software called Cogment. We then demonstrate several interesting results such as (a) HITL leads to faster training and higher performance, (b) advice acts as a guiding direction for gradient methods and lowers variance, and (c) the amount of advice should neither be too large nor too small to avoid over-training and under-training. Finally, we illustrate the role of human-AI cooperation in solving two real-world complex scenarios, i.e., overloaded and decoy attacks.
随着深度强化学习(DRL)的日益流行,人机协作(HITL)的方法有潜力彻底改变我们解决决策问题的方式,并为人类与人工智能的合作创造新的机会。在这篇文章中,我们介绍了一种新颖的多层次分层的人机协作 DRL 算法,该算法包含三种类型的机器学习:自学习、模仿学习和迁移学习。此外,我们考虑了三种形式的人类输入:奖励、行动和示范。我们还讨论了解决复杂问题时人机协作的主要挑战、权衡及优势,并探讨如何系统地将人类信息整合到 AI 解决方案中。 为了验证我们的技术成果,我们在一个实际的无人驾驶飞机(UAV)问题上进行了测试,在该问题中,多架敌方无人机攻击某个受保护区域。目标是设计一种可扩展的人机协作 DRL 算法,使盟友无人机能够提前击退敌方无人机以防止其到达该区域。为此,我们首先使用一个获奖开源人机协作软件 Cogment 实现我们的解决方案。然后,我们展示了几个有趣的结果:(a)人机协作可以加快训练过程并提高性能;(b)建议行为作为梯度方法的指导方向,并降低了方差;(c)提供的建议数量既不宜过多也不宜过少,以避免过度训练和不足训练。 最后,我们展示了人类与 AI 合作在解决两个现实世界复杂场景中的作用,即超载攻击和诱饵攻击。
https://arxiv.org/abs/2504.17006
Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks. However, the high costs associated with such exceptional performance limit the widespread adoption of LLMs, highlighting the need for prompt compression. Existing prompt compression methods primarily rely on heuristic truncation or abstractive summarization techniques, which fundamentally overlook the intrinsic mechanisms of LLMs and lack a systematic evaluation of token importance for generation. In this work, we introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens based on the analysis of attention scores of hidden states. PIS employs a dual-level compression mechanism: 1) at the token level, we quantify saliency using LLM-native attention scores and implement adaptive compression through a lightweight 9-layer reinforcement learning (RL) network; 2) at the semantic level, we propose a Russian roulette sampling strategy for sentence-level importance sampling. Comprehensive evaluations across multiple domain benchmarks demonstrate that our method achieves state-of-the-art compression performance. Notably, our framework serendipitously enhances reasoning efficiency through optimized context structuring. This work advances prompt engineering by offering both theoretical grounding and practical efficiency in context management for LLMs.
大型语言模型(LLM)已经取得了显著的进步,在各种自然语言处理任务中展现了前所未有的能力。然而,伴随这种卓越性能而来的高昂成本限制了LLM的广泛应用,这突显了压缩提示语的重要性。现有的提示压缩方法主要依赖于启发式截断或抽象摘要技术,这些方法从根本上忽视了大型语言模型的内在机制,并且缺乏对生成过程中令牌重要性进行系统评估的方法。在这项工作中,我们引入了一种新颖的压缩框架——Prompt Importance Sampling(PIS),该框架通过基于隐藏状态注意力分数分析来动态地抽样和压缩提示语中的关键令牌。 PIS采用双层压缩机制:1)在令牌层面,我们利用LLM原生的注意力评分量化显著性,并通过一个轻量级的9层强化学习(RL)网络实现自适应压缩;2)在语义层面,我们提出了使用俄罗斯轮盘赌抽样策略进行句子级别的重要性采样。多个领域的基准测试全面验证了我们的方法达到了最先进的压缩性能水平。值得注意的是,我们的框架意外地通过优化上下文结构提升了推理效率。 这项工作为提示工程领域提供了理论依据和实际操作的高效解决方案,特别是在大型语言模型的上下文管理方面。
https://arxiv.org/abs/2504.16574
An ongoing research challenge within several domains in computer vision is how to increase model generalization capabilities. Several attempts to improve model generalization performance are heavily inspired by human perceptual intelligence, which is remarkable in both its performance and efficiency to generalize to unknown samples. Many of these methods attempt to force portions of the network to be orthogonal, following some observation within neuroscience related to early vision processes. In this paper, we propose a loss component that regularizes the filtering kernels in the first convolutional layer of a network to make them nearly orthogonal. Deviating from previous works, we give the network flexibility in which pairs of kernels it makes orthogonal, allowing the network to navigate to a better solution space, imposing harsh penalties. Without architectural modifications, we report substantial gains in generalization performance using the proposed loss against previous works (including orthogonalization- and saliency-based regularization methods) across three different architectures (ResNet-50, DenseNet-121, ViT-b-16) and two difficult open-set recognition tasks: presentation attack detection in iris biometrics, and anomaly detection in chest X-ray images.
计算机视觉领域的多个领域中,一个持续的研究挑战是如何提高模型的泛化能力。许多旨在改进模型泛化性能的方法深受人类感知智能的启发,这种智能在处理未知样本时既高效又出色。这些方法中的许多试图强制网络的一部分使其滤波核正交化,这受到了神经科学早期视觉过程相关观察结果的影响。在这篇论文中,我们提出了一种损失函数组件,它对网络第一卷积层中的过滤器内核进行正则化处理,以使它们几乎正交化。与先前的工作不同的是,我们给予网络灵活性,使其能够选择哪些内核对需要正交化,允许网络进入更好的解空间,并施加严厉的惩罚措施。在不修改架构的情况下,我们在三个不同的模型(ResNet-50、DenseNet-121 和 ViT-b-16)和两个具有挑战性的开放集识别任务(虹膜生物识别中的呈现攻击检测以及胸部X光图像中的异常检测)中报告了相较于先前工作包括正交化及基于显著性图的正则化方法,使用我们提出的损失函数在模型泛化性能方面取得了显著改进。
https://arxiv.org/abs/2504.16362
Control Flow Graphs (CFGs) are critical for analyzing program execution and characterizing malware behavior. With the growing adoption of Graph Neural Networks (GNNs), CFG-based representations have proven highly effective for malware detection. This study proposes a novel framework that dynamically constructs CFGs and embeds node features using a hybrid approach combining rule-based encoding and autoencoder-based embedding. A GNN-based classifier is then constructed to detect malicious behavior from the resulting graph representations. To improve model interpretability, we apply state-of-the-art explainability techniques, including GNNExplainer, PGExplainer, and CaptumExplainer, the latter is utilized three attribution methods: Integrated Gradients, Guided Backpropagation, and Saliency. In addition, we introduce a novel aggregation method, called RankFusion, that integrates the outputs of the top-performing explainers to enhance the explanation quality. We also evaluate explanations using two subgraph extraction strategies, including the proposed Greedy Edge-wise Composition (GEC) method for improved structural coherence. A comprehensive evaluation using accuracy, fidelity, and consistency metrics demonstrates the effectiveness of the proposed framework in terms of accurate identification of malware samples and generating reliable and interpretable explanations.
控制流图(CFG)对于分析程序执行和表征恶意软件行为至关重要。随着图神经网络(GNNs)的广泛应用,基于CFG的表示方法已被证明在检测恶意软件方面非常有效。本研究提出了一种新的框架,该框架能够动态构建CFG,并结合基于规则的编码与自编码器嵌入的混合方式来嵌入节点特征。随后构建了一个基于GNN的分类器,从生成的图表示中检测恶意行为。为了提高模型的可解释性,我们应用了最先进的可解释技术,包括GNNExplainer、PGExplainer和CaptumExplainer,后者使用三种归因方法:集成梯度(Integrated Gradients)、指导反向传播(Guided Backpropagation)和显著图(Saliency)。此外,我们还提出了一种新的聚合方法——RankFusion,它整合了顶级解释器的输出以提高解释质量。我们还通过两种子图提取策略评估了解释的有效性,包括我们提出的贪心边式组合法(GEC),用于改善结构的一致性。使用准确性、忠实度和一致性指标进行的全面评价表明,所提框架在准确识别恶意软件样本以及生成可靠且可解释的解释方面均表现出色。
https://arxiv.org/abs/2504.16316
Object classification models utilizing point cloud data are fundamental for 3D media understanding, yet they often struggle with unseen or out-of-distribution (OOD) scenarios. Existing point cloud unsupervised domain adaptation (UDA) methods typically employ a multi-task learning (MTL) framework that combines primary classification tasks with auxiliary self-supervision tasks to bridge the gap between cross-domain feature distributions. However, our further experiments demonstrate that not all gradients from self-supervision tasks are beneficial and some may negatively impact the classification performance. In this paper, we propose a novel solution, termed Saliency Map-based Data Sampling Block (SM-DSB), to mitigate these gradient conflicts. Specifically, our method designs a new scoring mechanism based on the skewness of 3D saliency maps to estimate gradient conflicts without requiring target labels. Leveraging this, we develop a sample selection strategy that dynamically filters out samples whose self-supervision gradients are not beneficial for the classification. Our approach is scalable, introducing modest computational overhead, and can be integrated into all the point cloud UDA MTL frameworks. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches. In addition, we provide a new perspective on understanding the UDA problem through back-propagation analysis.
基于点云数据的对象分类模型是理解3D媒体的基础,但它们通常在未见过或分布外(OOD)的场景中表现不佳。现有的无监督领域适应(UDA)方法通常采用多任务学习(MTL)框架,该框架结合了主要分类任务和辅助自监督任务来弥合跨域特征分布之间的差距。然而,进一步的实验表明,并非所有来自自监督任务的梯度都是有益的,有些甚至可能对分类性能产生负面影响。为此,在本文中我们提出了一种新的解决方案,称为基于显著图的数据采样模块(SM-DSB),以缓解这些梯度冲突问题。 具体而言,我们的方法设计了一个新的评分机制,该机制基于3D显著图的偏度来估计没有目标标签的情况下梯度冲突的情况。利用这一评分机制,我们开发了一种样本选择策略,可以动态地过滤掉那些自监督任务产生的梯度对分类不利的样本。这种方法具有可扩展性,并且只需要适度的计算开销,在所有点云UDA MTL框架中均可轻松集成。 广泛的实验结果表明,我们的方法在性能上超过了现有的最先进的方法。此外,我们通过反向传播分析提供了理解UDA问题的新视角。
https://arxiv.org/abs/2504.15796
Age prediction from medical images or other health-related non-imaging data is an important approach to data-driven aging research, providing knowledge of how much information a specific tissue or organ carries about the chronological age of the individual. In this work, we studied the prediction of age from computed tomography angiography (CTA) images, which provide detailed representations of the heart morphology, with the goals of (i) studying the relationship between morphology and aging, and (ii) developing a novel \emph{morphological heart age} biomarker. We applied an image registration-based method that standardizes the images from the whole cohort into a single space. We then extracted supervoxels (using unsupervised segmentation), and corresponding robust features of density and local volume, which provide a detailed representation of the heart morphology while being robust to registration errors. Machine learning models are then trained to fit regression models from these features to the chronological age. We applied the method to a subset of the images from the Swedish CArdioPulomonary bioImage Study (SCAPIS) dataset, consisting of 721 females and 666 males. We observe a mean absolute error of $2.74$ years for females and $2.77$ years for males. The predictions from different sub-regions of interest were observed to be more highly correlated with the predictions from the whole heart, compared to the chronological age, revealing a high consistency in the predictions from morphology. Saliency analysis was also performed on the prediction models to study what regions are associated positively and negatively with the predicted age. This resulted in detailed association maps where the density and volume of known, as well as some novel sub-regions of interest, are determined to be important. The saliency analysis aids in the interpretability of the models and their predictions.
从医学影像或其他与健康相关的非图像数据中预测年龄是一种重要的基于数据的老化研究方法,它提供了关于特定组织或器官在多大程度上反映了个体生理年龄的信息。在这项工作中,我们研究了通过计算机断层血管成像(CTA)图像来预测年龄的方法,这些图像是心脏形态的详细表示,并且我们的目标是 (i) 研究形态与衰老之间的关系,以及(ii) 开发一种新的“形态心脏年龄”生物标志物。我们应用了一种基于图像配准的方法将整个队列的所有影像标准化到同一空间中。然后,提取了超体素(使用无监督分割方法),并相应地提取了密度和局部体积的鲁棒特征,这些特征能够详细表示心脏形态,并且对于配准误差具有较强的鲁棒性。接着,通过训练机器学习模型来拟合这些特征与生理年龄之间的回归模型。我们将这种方法应用于瑞典心脏肺部生物图像研究(SCAPIS)数据集的一部分,该部分包含721名女性和666名男性的影像资料。我们观察到对于女性的平均绝对误差为2.74年,男性则为2.77年。来自不同感兴趣区域的预测结果与整个心脏区域的预测高度相关,而与生理年龄的相关性较低,这揭示了形态学预测的一致性较高。我们也对预测模型进行了显著性分析以研究哪些区域与预测年龄正向或负向关联。这种分析生成了详细的关联图,其中确定了一些已知和一些新颖感兴趣区域的密度和体积是重要的。显著性分析有助于理解模型及其预测结果的可解释性。
https://arxiv.org/abs/2504.15783
Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.
红外低信噪比小目标检测由于动态多帧场景和弱目标特征,在红外模式下面临重大挑战。传统的低秩加稀疏模型常常无法捕捉动态背景以及全局时空相关性,从而导致背景泄露或目标丢失的问题。在本文中,我们提出了一种新的基于运动增强的非局部相似性的隐式神经网络表示(INR)框架来应对这些挑战。 首先,我们将通过光流估计的方式整合运动估算以捕获细微的目标移动,并提出了多帧融合技术来增强运动显著性。其次,我们利用非局部相似性构造具有强低秩性质的补丁张量,并提出了一种基于张量分解的创新INR模型来表示这些非局部补丁张量,通过连续神经网络表示有效地编码背景中的非局部低秩性和时空相关性。 为了对非局部INR模型进行优化,我们开发了一种交替方向乘子法(ADMM),该方法具有理论上的固定点收敛特性。实验结果表明,我们的方法能够稳健地将微弱目标从复杂的红外背景中分离出来,在检测准确率和鲁棒性方面优于现有的最先进技术。
https://arxiv.org/abs/2504.15665
Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases. In this work, we first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution. These metrics help uncover patterns in attention allocation and diagnostic strategies. Furthermore, we investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images. To achieve this, we examine fixation bias maps, focusing on first, last, short, and longest fixations independently, along with detailed saccades patterns, to quantify differences in gaze distribution and visual saliency between authentic and synthetic images.
眼动分析在医学影像领域扮演着至关重要的角色,它提供了关于放射科医生如何视觉解读和诊断临床病例的重要见解。在这项工作中,我们首先通过测量各种眼动模式的分布情况(包括扫视的方向、幅度及其联合分布)来分析放射科医生的关注点及意见一致性。这些指标有助于揭示注意力分配与诊断策略中的模式。 此外,我们还研究了医生在查看真实图像(Real)和深度学习生成的假图像(Fake)时,其注视行为是否会以及如何发生变化。为实现这一目标,我们检查了固定偏置图,并独立分析了首次、最后一次、短暂及最长持续时间内的注视点,同时详细考察扫视模式,以此来量化真实与合成图像之间注视分布和视觉显著性的差异。
https://arxiv.org/abs/2504.15007
Thermal infrared (TIR) images typically lack detailed features and have low contrast, making it challenging for conventional feature extraction models to capture discriminative target characteristics. As a result, trackers are often affected by interference from visually similar objects and are susceptible to tracking drift. To address these challenges, we propose a novel saliency-guided Siamese network tracker based on key fine-grained feature infor-mation. First, we introduce a fine-grained feature parallel learning convolu-tional block with a dual-stream architecture and convolutional kernels of varying sizes. This design captures essential global features from shallow layers, enhances feature diversity, and minimizes the loss of fine-grained in-formation typically encountered in residual connections. In addition, we propose a multi-layer fine-grained feature fusion module that uses bilinear matrix multiplication to effectively integrate features across both deep and shallow layers. Next, we introduce a Siamese residual refinement block that corrects saliency map prediction errors using residual learning. Combined with deep supervision, this mechanism progressively refines predictions, ap-plying supervision at each recursive step to ensure consistent improvements in accuracy. Finally, we present a saliency loss function to constrain the sali-ency predictions, directing the network to focus on highly discriminative fi-ne-grained features. Extensive experiment results demonstrate that the pro-posed tracker achieves the highest precision and success rates on the PTB-TIR and LSOTB-TIR benchmarks. It also achieves a top accuracy of 0.78 on the VOT-TIR 2015 benchmark and 0.75 on the VOT-TIR 2017 benchmark.
热红外(TIR)图像通常缺乏细节特征,对比度低,这使得传统的特征提取模型难以捕捉到目标的判别性特性。因此,追踪器经常受到视觉相似物体干扰的影响,并且容易出现跟踪漂移的问题。为了解决这些问题,我们提出了一种基于关键细粒度特征信息的新颖的以显著性为导向的孪生网络追踪器。首先,我们引入了一个具有双流架构和不同大小卷积核的细粒度特征并行学习卷积块。这种设计从浅层捕捉到重要的全局特征,增强了特征多样性,并且最小化了残差连接中常见的细粒度信息丢失问题。 此外,我们提出了一种多层细粒度特征融合模块,该模块使用双线性矩阵乘法有效地在深层和浅层之间集成特征。接下来,我们引入了一个Siamese残差精炼块,利用残差学习校正显著图预测误差。结合深度监督机制,这种设计逐步细化预测,并在每次递归步骤中施加监督以确保准确性的一致提升。 最后,我们提出了一种限制显著性预测的显著性损失函数,引导网络专注于具有高度判别性的细粒度特征。大量的实验结果显示,在PTB-TIR和LSOTB-TIR基准测试上,所提出的追踪器达到了最高的准确率和成功率。此外,在VOT-TIR 2015基准上达到的最高精度为0.78,在VOT-TIR 2017基准上的准确性也达到了0.75。
https://arxiv.org/abs/2504.14309
Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.
视频显著性预测对于下游应用(如视频压缩和人机交互)至关重要。随着多模态学习的兴起,研究人员开始探索基于音频-视觉和文本-视觉方法的多模态视频显著性预测。听觉线索引导观众将注意力集中在声音来源上,而文本提示则为理解视频内容提供语义指导。融合这些互补信息可以提高显著性预测的准确性。因此,在本文中我们尝试同时分析视觉、听觉和文本模态,并提出了TAVDiff——一种基于条件扩散模型(Text-Audio-Visual-conditioned Diffusion Model)用于视频显著性预测的方法。 TAVDiff将视频显著性预测视为一项基于文本、音频及视觉输入的图像生成任务,通过逐步去噪来预测显著图。为了有效利用文本信息,使用了一个大型多模态模型为每一帧视频生成描述,并引入了一种以显著性为导向的图文响应机制(SITR)来生成图文响应图。这种响应图作为条件信息用于指导模型定位与文本描述语义相关的视觉区域。对于听觉模态,它被用作另一种引导模型关注声音指示的重要区域的信息。同时,由于扩散变压器(DiT)直接将条件信息与时步进行拼接,这可能会影响噪声水平的估计。为了实现有效的条件指导,我们提出了Saliency-DiT,该方法解耦了条件信息和时步。 实验结果表明,相较于现有的方法,TAVDiff在SIM、CC、NSS 和 AUC-J 指标上分别提升了1.03%,2.35%,2.71%和 0.33%。
https://arxiv.org/abs/2504.14267
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at this https URL.
现有的共显著对象检测(CoSOD)方法通常采用三阶段架构(即编码、共识提取与扩散,以及预测),并结合典型的全量微调范式。尽管这些方法取得了一定的效果,但它们存在两个明显的局限性:1) 这种架构依赖于编码特征来进行共识抽取,然而仔细提取的共识却不能及时指导编码阶段;2) 这种范式涉及全局更新模型的所有参数,这在参数效率方面表现不佳,并阻碍了基础模型在这项任务中有效表示知识的能力。因此,在本文中,我们提出了一种针对CoSOD任务的交互效果好且参数高效的简洁架构,以解决上述两个关键问题。该方法首次引入了一个参数高效的任务提示微调范式,并无缝嵌入共识到提示中,从而形成特定于任务的视觉共识提示(VCP)。我们的VCP旨在通过使用最小化可调节参数制定特定于任务的视觉共识提示来诱导冻结的基础模型在CoSOD任务上表现更好。具体而言,有意图的共识提示生成器(CPG)的主要见解是强制有限的可调参数专注于共显著表示并生成共识提示。所形成的共识提示扩散器(CPD)利用共识提示形成特定于任务的视觉共识提示,从而激发预训练模型在处理CoSOD任务时的强大潜力。广泛的实验表明,我们简洁的VCP优于13种最先进的全量微调模型,在最具挑战性的CoCA数据集上实现了新的最佳性能(F_m指标提高了6.8%)。源代码可在此网址获取:https://this https URL. (请注意,URL部分需要替换为实际提供的链接地址)
https://arxiv.org/abs/2504.14254
Self-mixing interferometry (SMI) has been lauded for its sensitivity in detecting microvibrations, while requiring no physical contact with its target. Microvibrations, i.e., sounds, have recently been used as a salient indicator of extrinsic contact in robotic manipulation. In previous work, we presented a robotic fingertip using SMI for extrinsic contact sensing as an ambient-noise-resilient alternative to acoustic sensing. Here, we extend the validation experiments to the frequency domain. We find that for broadband ambient noise, SMI still outperforms acoustic sensing, but the difference is less pronounced than in time-domain analyses. For targeted noise disturbances, analogous to multiple robots simultaneously collecting data for the same task, SMI is still the clear winner. Lastly, we show how motor noise affects SMI sensing more so than acoustic sensing, and that a higher SMI readout frequency is important for future work. Design and data files are available at this https URL.
自混合干涉测量(SMI)因其在检测微振动方面的高灵敏度而受到赞誉,且无需与目标物体进行物理接触。近期研究表明,微振动(即声音)已被用作机器人操作中外来接触的一个显著指标。在此前的工作中,我们展示了一个使用SMI进行外在接触感应的仿生指尖传感器,作为声学传感的一种抗环境噪声干扰的替代方案。本文进一步将验证实验扩展到频域分析。结果表明,在宽带背景噪音条件下,尽管SMI仍优于声学传感,但其性能优势不如时域分析中那么显著。对于有针对性的噪音干扰(例如多个机器人同时收集同一任务的数据),SMI依然胜出。最后,我们展示了电机噪声对SMI感知的影响大于对声学感知的影响,并指出在未来的工作中采用更高的SMI读取频率非常重要。设计和数据文件可在提供的链接中获取。
https://arxiv.org/abs/2504.13711
As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{this https URL}{\textbf{this https URL}}.
随着数字内容的普及,由于现有嵌入技术缺乏鲁棒性,去除水印的需求日益增加。本文介绍了一种新颖的基于显著性的扩散重建(Saliency-Aware Diffusion Reconstruction, SADRE)框架,用于在网络上消除水印。该框架结合了自适应噪声注入、特定区域扰动和先进的扩散式重构方法。通过在由显著性掩码引导的潜在表示中注入有针对性的噪声,SADRE能够破坏嵌入的水印,同时保留图像的基本特征。逆向扩散过程确保了高保真度的图像恢复,利用自适应噪声级别来确定水印强度。我们的框架具有理论上的稳定性保证,并在各种场景下实现了鲁棒的水印去除效果。实证评估表明,在最先进的(State-of-the-Art, SOTA)水印技术上,SADRE在破坏水印和保持图像质量方面表现出色。SADRE为水印消除设定了新的基准,提供了一种灵活且可靠的解决方案,适用于现实世界中的网络内容。 代码可以在[这里](this https URL)获取。
https://arxiv.org/abs/2504.12809
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at this https URL
确保文本到图像模型的伦理部署需要有效技术来防止有害或不适当内容的生成。虽然概念擦除方法提供了一个有希望的解决方案,但现有的基于微调的方法存在显著的局限性。无锚点的方法可能会破坏采样轨迹,导致视觉伪影,而有锚点的方法则依赖于启发式的锚概念选择。为了解决这些问题,我们引入了一种称为ANT(自动引导去噪轨迹以避免不需要的概念)的微调框架。ANT建立在一个关键洞察之上:在中期到晚期去噪阶段反转无条件指导的方向可以在不牺牲早期结构完整性的情况下精确修改内容。这激发了一个具有轨迹意识的目标,该目标保留了早期评分函数场的完整性,从而将样本引导至自然图像流形,而无需依赖启发式的锚概念选择。 对于单一概念擦除,我们提出了一种增强权重显著性图的方法来精确识别对不需要的概念贡献最大的关键参数,这使得擦除更加彻底和高效。对于多概念擦除,我们的目标函数提供了一个灵活的即插即用解决方案,可以大大提升性能。广泛的实验表明,在单个和多个概念擦除方面,ANT达到了最先进的结果,提供了高质量且安全的输出,同时不牺牲生成保真度。 代码可在以下链接获取:[此URL](请将[此URL]替换为实际提供的GitHub或其他代码托管平台上的具体网址)。
https://arxiv.org/abs/2504.12782
Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.
最近的深度学习进展,特别是频率动态卷积(FDY conv),通过实现频域自适应特征提取,在声音事件检测(SED)方面取得了显著改进。然而,FDY conv 依赖于时间平均池化(temporal average pooling, TAP),该方法将所有时间帧同等对待,这限制了其捕捉短暂声音事件(如警报铃声、敲门声和爆破音)的能力。 为解决这一局限性,我们提出了一个结合时序注意力池化的频率动态卷积(TFD conv)。新的方法用时间注意力池化(TAP)替代传统的平均池化。TAP 通过三种互补机制自适应地加权时间特征:时间注意力池化(TA)用于强调显著特征,速度注意力池化(VA)用于捕捉瞬态变化,传统平均池化则确保在处理静止信号时的鲁棒性。 消融研究表明,在仅增加参数数量14.8%的情况下,TFD conv 就能使平均PSDS1分数较FDY conv提升3.02%。此外,类别分析(classwise ANOVA)和Tukey HSD检验进一步表明,对于瞬态事件密集的场景,TFD conv 能显著提高检测性能,并且优于现有的 FDY conv 模型。值得注意的是,在PSDS1评分上,TFD conv 达到了0.456的最高分,超越了之前的最佳SED系统。 我们还探讨了 TAP 与其它FDY conv变体(包括扩张频率动态卷积(DFD conv)、部分频率动态卷积(PFD conv)和多尺度频率动态卷积(MDFD conv))的兼容性。其中,TAP 和 MDFD conv 的结合表现最佳,PSDS1评分达到了0.459,这验证了时序注意力与多尺度频域适应性的互补优势。 这些发现确立了 TFD conv 作为增强瞬态敏感性和整体特征鲁棒性的一种强大且通用的SED框架。
https://arxiv.org/abs/2504.12670
The necessity of abundant annotated data and complex network architectures presents a significant challenge in deep-learning Salient Object Detection (deep SOD) and across the broader deep-learning landscape. This challenge is particularly acute in medical applications in developing countries with limited computational resources. Combining modern and classical techniques offers a path to maintaining competitive performance while enabling practical applications. Feature Learning from Image Markers (FLIM) methodology empowers experts to design convolutional encoders through user-drawn markers, with filters learned directly from these annotations. Recent findings demonstrate that coupling a FLIM encoder with an adaptive decoder creates a flyweight network suitable for SOD, requiring significantly fewer parameters than lightweight models and eliminating the need for backpropagation. Cellular Automata (CA) methods have proven successful in data-scarce scenarios but require proper initialization -- typically through user input, priors, or randomness. We propose a practical intersection of these approaches: using FLIM networks to initialize CA states with expert knowledge without requiring user interaction for each image. By decoding features from each level of a FLIM network, we can initialize multiple CAs simultaneously, creating a multi-level framework. Our method leverages the hierarchical knowledge encoded across different network layers, merging multiple saliency maps into a high-quality final output that functions as a CA ensemble. Benchmarks across two challenging medical datasets demonstrate the competitiveness of our multi-level CA approach compared to established models in the deep SOD literature.
在深度学习显著目标检测(deep SOD)及更广泛的深度学习领域中,大量标注数据和复杂网络架构的需求构成了重大挑战。这一挑战尤其在计算资源有限的发展中国家的医疗应用中显得尤为严峻。结合现代与古典技术为保持竞争力并实现实际应用提供了一条途径。图像标记特征学习(Feature Learning from Image Markers, FLIM)方法使专家能够通过用户绘制的标记设计卷积编码器,并直接从这些标注中学习滤波器。近期研究发现,将FLIM编码器与自适应解码器结合可以创建适合SOD的轻量级网络,所需参数远少于现有轻量级模型,并且无需反向传播。 细胞自动机(Cellular Automata, CA)方法在数据稀缺场景中表现出色,但需要适当的初始化——通常通过用户输入、先验知识或随机性。我们提出了一种实用的方法:使用FLIM网络根据专家知识初始化CA状态,而不需要为每张图像进行用户交互式操作。通过对每一级FLIM网络中的特征解码,我们可以同时初始化多个CA,从而创建一个多层级框架。我们的方法利用了不同网络层次中编码的分层知识,并将多个显著性图合并成一个高质量的最终输出,该输出可以作为CA集合运行。 在两个具有挑战性的医学数据集上的基准测试表明,我们提出的多层次CA方法与深度SOD文献中的现有模型相比,在性能上具备竞争力。
https://arxiv.org/abs/2504.11406
A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user's behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent's limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.
一个由大型语言模型(LLM)驱动的图形用户界面(GUI)代理是一种专门的自主系统,它根据高级指令代表用户执行任务。这通过感知和解释相关应用程序的图形用户界面来实现,通常以视觉方式完成,并推断必要的操作序列,然后通过执行如点击、输入和轻触等动作与GUI进行交互。为了完成诸如填写表格或预订服务之类的现实世界任务,GUI代理常常需要处理和采取行动基于敏感的用户数据。然而,这种自主性引入了新的隐私和安全风险。对手可以将恶意内容注入到GUI中以改变代理的行为或将私人信息意外泄露出去。这些攻击往往利用了对代理与人类用户之间视觉显著性的差异或在任务自动化过程中代理检测上下文完整性违规的能力有限。 在这篇论文中,我们定义并描述了六种此类攻击,并进行了实验研究来测试这六种最先进的GUI代理、234个对抗性网页和39名人类参与者。我们的发现表明,GUI代理特别容易受到嵌入在上下文中的威胁的影响。此外,许多这些攻击也会使普通用户受到影响,这意味着简单的手动监督可能无法可靠地防止此类问题的发生。这种不一致突显了设计具有隐私意识的代理的重要性。我们提出了一些实用的防御策略以指导开发更安全和可靠的GUI代理。 (注:这段翻译已经尽力保持原文的意思和结构完整性,并适当调整使其更加流畅易读。)
https://arxiv.org/abs/2504.11281
Predicting personality traits automatically has become a challenging problem in computer vision. This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips. For visual processing, we construct a facial graph and design a Geo-based two-stream network incorporating an attention mechanism, leveraging both Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) to capture static facial expressions. Additionally, ResNet18 and VGGFace networks are employed to extract global scene and facial appearance features at the frame level. To capture dynamic temporal information, we integrate a BiGRU with a temporal attention module for extracting salient frame representations. To enhance the model's robustness, we incorporate the VGGish CNN for audio-based features and XLM-Roberta for text-based features. Finally, a multimodal channel attention mechanism is introduced to integrate different modalities, and a Multi-Layer Perceptron (MLP) regression model is used to predict personality traits. Experimental results confirm that our proposed framework surpasses existing state-of-the-art approaches in performance.
自动预测人格特征已成为计算机视觉领域的一项挑战性问题。本文介绍了一种创新的多模态特征学习框架,用于对短视频片段进行个性分析。在视觉处理方面,我们构建了一个面部图谱,并设计了一种基于地理的双流网络,该网络结合了注意力机制,利用图卷积神经网络(GCN)和卷积神经网络(CNN)来捕捉静态面部表情。此外,在帧级别上还采用了ResNet18和VGGFace网络来提取全局场景和面部外观特征。为了捕捉动态的时间信息,我们整合了一个带有时间注意力模块的BiGRU,用于提取显著帧表示。为了增强模型的鲁棒性,我们引入了基于音频的VGGish CNN以及基于文本的XLM-Roberta来提取相应的特性。最后,我们引入了一种多模态通道注意机制以融合不同模态,并使用多层感知器(MLP)回归模型进行人格特征预测。实验结果证实,我们的框架在性能上超越了现有的最先进的方法。
https://arxiv.org/abs/2504.11515
Salient Object Detection (SOD) with deep learning often requires substantial computational resources and large annotated datasets, making it impractical for resource-constrained applications. Lightweight models address computational demands but typically strive in complex and scarce labeled-data scenarios. Feature Learning from Image Markers (FLIM) learns an encoder's convolutional kernels among image patches extracted from discriminative regions marked on a few representative images, dismissing large annotated datasets, pretraining, and backpropagation. Such a methodology exploits information redundancy commonly found in biomedical image applications. This study presents methods to learn dilated-separable convolutional kernels and multi-dilation layers without backpropagation for FLIM networks. It also proposes a novel network simplification method to reduce kernel redundancy and encoder size. By combining a FLIM encoder with an adaptive decoder, a concept recently introduced to estimate a pointwise convolution per image, this study presents very efficient (named flyweight) SOD models for biomedical images. Experimental results in challenging datasets demonstrate superior efficiency and effectiveness to lightweight models. By requiring significantly fewer parameters and floating-point operations, the results show competitive effectiveness to heavyweight models. These advances highlight the potential of FLIM networks for data-limited and resource-constrained applications with information redundancy.
基于深度学习的显著物体检测(Salient Object Detection,SOD)通常需要大量的计算资源和大规模标注数据集,这在计算资源受限的应用场景中是不可行的。轻量级模型虽然解决了计算需求问题,但在复杂且标签数据稀缺的情况下表现不佳。图像标记特征学习(Feature Learning from Image Markers,FLIM)通过从少量代表性图片上的鉴别性区域提取出的图像块来学习编码器的卷积核,这种方法不需要大规模标注数据集、预训练和反向传播。这一方法利用了生物医学图像应用中常见的信息冗余。 本文提出了一种无需反向传播即可学习膨胀可分离卷积核和多膨胀层的方法,并为FLIM网络引入了一种新的网络简化方法来减少内核冗余并减小编码器尺寸。通过结合一个FLIM编码器与一个自适应解码器,即最近用于估计每个图像的逐点卷积的概念,本文提出了一系列非常高效的(命名为“flyweight”)SOD模型,专为生物医学图像设计。 实验结果表明,在具有挑战性的数据集中,这些模型在效率和效果上均优于轻量级模型。通过显著减少参数数量和浮点运算次数,所提出的模型展示了与重型模型相媲美的有效性。这些进展突显了FLIM网络在信息冗余的、数据有限和资源受限的应用场景中的潜力。
https://arxiv.org/abs/2504.11112
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
部分相关视频检索(PRVR)是文本到视频检索中的一项实用且具有挑战性的任务,其中的视频未经过修剪,并包含大量背景内容。该领域的目标在于寻找既有效又高效的解决方案,以捕捉文字查询与未修剪视频之间的局部对应关系。然而,现有的PRVR方法通常侧重于多尺度片段表示建模,但这些方法往往受到内容独立性和信息冗余的影响,从而损害了检索性能。 为了克服这些限制,我们提出了一种简单而有效的主动时刻发现(AMDNet)方法。我们的目标是发现与查询语义一致的视频时刻。通过使用可学习的时间间隔锚点来捕捉不同的时刻,并应用掩码多时刻注意力机制以强调显著性时刻同时抑制冗余背景,我们可以实现更为紧凑和信息丰富的视频表示。 为进一步增强时刻建模,我们引入了两种损失函数:一种是多样性损失(moment diversity loss),鼓励不同区域中不相同的时刻;另一种是相关性损失(moment relevance loss),促进与查询语义相关的时刻。这两种损失协同工作,并与部分相关检索损失共同作用于端到端优化。 在两个大规模视频数据集(即TVR和ActivityNet Captions)上的广泛实验表明,我们的AMDNet方法具有优越性和效率。具体而言,在TVR数据集中,AMDNet比当前最先进的GMMFormer方法小约15.5倍(参数数量),同时性能高出6.0分(SumR)。
https://arxiv.org/abs/2504.10920