Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in this https URL.
主题一致的生成(Subject-consistent generation,SCG)旨在让文本到图像(Text-to-Image,T2I)模型在不同场景中保持主体身份的一致性。然而,现有的无需训练的方法往往以牺牲布局和姿态多样性为代价来实现一致性,这限制了视觉叙事的表现力。为了克服这一局限性,我们提出了一个名为CoDi的主题一致且姿态多样的T2I框架,该框架能够在保证主题一致性的前提下生成具有多种姿势和布局的图像。 受到扩散模型渐进性质的启发,在早期去噪步骤中会出现粗略结构,而在后期去噪步骤中则会进一步细化细节。因此,CoDi采用了两阶段策略:身份传输(Identity Transport, IT)和身份细化(Identity Refinement, IR)。IT在早期去噪步骤中运行,利用最优传输技术将身份特征以姿态感知的方式转移到每个目标图像上,在保持主题一致性的同时保留姿势多样性。IR则是在后期的去噪步骤中应用,选择最显著的身份特征来进一步细化主体细节。 通过大量定性和定量实验结果表明,CoDi在主题一致性、姿态多样性和指令忠实度方面均表现出色,并且在所有评估指标上的视觉感知和性能都优于现有方法。相关代码可以在提供的链接处找到:[请在这里插入实际的URL地址]。
https://arxiv.org/abs/2507.08396
Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
对比视觉语言模型如CLIP被广泛应用于各种应用场景,包括零样本分类或多模态模型中的视觉编码器。尽管这些模型很受欢迎,但它们的表示形式存在重大局限性。例如,CLIP模型学习的是基于词袋的表示方法,因此无法区分“一艘黄色潜水艇和一辆蓝色巴士”与“一艘蓝色潜水艇和一辆黄色巴士”。以往尝试通过在训练中加入硬负样本或修改架构来解决这一问题的努力并未彻底解决问题。我们怀疑解决CLIP绑定问题的关键见解隐藏于学习算法中最重要的一部分:数据本身。 在这项工作中,我们填补了这一空白,严谨地识别出数据属性对CLIP学习绑定能力的影响,并使用合成数据集进行了验证。我们发现,自然数据的常见特性(如低属性密度、不完整的描述和显著性偏差——即人类描述者倾向于描述他们认为“最显眼”的物体)都对绑定性能产生了负面影响。 与普遍看法不同的是,我们发现在训练中既不是通过增加批量大小来隐式添加更多硬负样本,也不是通过明确创建硬负样本使CLIP能够学习可靠的绑定。只有当数据体现出我们识别出的数据属性时,CLIP才能几乎完美地实现绑定。
https://arxiv.org/abs/2507.07985
The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at this https URL.
随着时间的推移,文本数据爆炸性的增长给揭示不断变化的主题和趋势带来了重大挑战。现有的动态主题建模技术虽然强大,但通常存在于缺乏对解释支持以及用户友好探索功能的碎片化工作流程中。我们引入了 DTECT(动态主题探索及上下文追踪器),这是一个从原始文本数据到有意义的时间洞察的一站式系统。DTECT 提供了一个统一的工作流,支持数据预处理、多种模型架构和专门评估时间主题模型质量的度量标准。 通过引入LLM驱动的自动话题标签生成、基于时间显著词汇的趋势分析、结合文档级别总结的交互式可视化以及用于直观查询数据的自然语言聊天接口等功能,DTECT 显著提高了解释性。将这些功能整合到一个统一的平台中,DTECT 能够让用户更有效地追踪和理解主题动态。 DTECT 是开源软件,并可在[此处](https://example.com/detect)获取。
https://arxiv.org/abs/2507.07910
Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.
准确的位置估计对于部署在自主平台(包括地面车辆、海洋船只和无人机)上的现代导航系统至关重要。在此背景下,视觉同时定位与地图构建 (VSLAM) 依赖于从视觉输入数据中可靠地提取显著特征点,其中包含视觉里程计。本文提出了一种嵌入式实现方案,该方案基于未监督架构,并能够检测和描述特征点。我们的方法采用量化后的 SuperPoint 卷积神经网络。 我们的目标是通过保持高检测质量的同时最小化模型的计算需求,从而在资源有限(如移动或嵌入式系统)平台上实现高效部署。我们在 AMD/Xilinx Zynq UltraScale+ FPGA 系统级芯片 (SoC) 平台上实现了这一解决方案,并评估了深度学习处理单元(DPUs) 的性能。此外,我们使用 Brevitas 库和 FINN 框架进行模型量化及硬件感知优化。这使我们能够在FPGA平台上以高达54 fps的帧率处理640x480像素图像,优于该领域的现有解决方案。 我们在 TUM 数据集上进行了实验,展示了不同量化技术对视觉里程计任务中模型精度和性能的影响,并讨论了这些影响。
https://arxiv.org/abs/2507.07903
Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for "neutral", "sadness", and "anger" in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.
基于深度学习的人脸表情识别(FER)系统近年来取得了显著的性能提升。然而,这些模型经常表现出与年龄相关的偏差,这可能损害其公平性和可靠性。在这项工作中,我们对深度FER模型中的年龄相关偏差进行了全面研究,并特别关注老年人群。首先,我们探讨了不同年龄段在面部表情识别表现上的差异,哪些情感表达最容易受到影响,以及模型注意力是否因年龄而异。通过可解释的人工智能(XAI)技术,我们发现系统性的情感识别和注意模式差异,特别是在老年个体的“中立”、“悲伤”和“愤怒”方面尤为显著。 基于这些发现,我们提出并评估了三种偏见缓解策略:多任务学习、多模态输入以及年龄加权损失。我们的模型在大规模数据集AffectNet上进行训练,并使用自动估计的年龄标签,在包括代表性不足年龄段在内的平衡基准数据集上进行了验证。结果显示,针对老年人群的情感识别准确率得到了一致的提高,尤其是在最容易出错的表情上效果尤为明显。 通过热图分析,我们发现采用年龄感知策略训练的模型会更加关注各年龄段的相关面部区域,从而有助于解释观察到的进步。这些研究结果表明,使用简单的训练调整可以有效地缓解FER中的年龄相关偏差,并且即使是粗略的人口统计信息标签也能在大规模情感计算系统中促进公平性。
https://arxiv.org/abs/2507.07638
Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.
体育视频中的精确事件识别(PES)要求从单摄像头拍摄的素材中进行细粒度动作的逐帧识别。现有的PES模型通常会集成轻量级的时间模块,如门移位模块(GSM)或门移位融合(GSF),以增强2D CNN特征提取器的时间上下文信息。然而,这些模块在时间感受野和空间适应性方面存在局限性。我们提出了一种多尺度注意力门移位模块(MSAGSM),该模块通过引入多尺度时间膨胀和多头空间注意机制增强了GSM的功能,从而能够高效地建模短期和长期依赖关系,并专注于显著区域。MSAGSM是一种轻量级的即插即用模块,可以轻松与各种2D骨干网络集成。 为了进一步推动这一领域的发展,我们推出了澳大利亚乒乓球(TTA)数据集——首个针对乒乓球运动的PES基准测试库,其中包含超过4800个精确定位和标注的事件。在五个PES基准上的广泛实验表明,MSAGSM能够在几乎不增加计算开销的情况下持续提升性能,并且创造了新的最佳性能记录。
https://arxiv.org/abs/2507.07381
Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.
从脑信号中解码视觉体验为神经科学和可解释的人工智能带来了令人兴奋的可能性。尽管EEG(脑电图)易于获取且时间分辨率高,但其在空间细节上的局限性阻碍了图像重建。我们的模型通过将EEG信号与大型语言模型生成的多层次语义描述对齐来绕过直接从EEG信号生成图像的过程——这些描述涵盖了从物体层面到抽象主题的不同层次。一个基于Transformer的EEG编码器利用对比学习方法,将脑电活动映射到这些语义描述上。在推理阶段,通过投影头检索出的描述嵌入条件化预训练的潜在扩散模型来生成图像。这一文本中介框架在EEGCVPR数据集上的视觉解码方面达到了最先进的水平,并且与已知神经认知路径具有可解释的一致性。主导的EEG-语义关联反映了从感知到的图像中提取的不同语义层次的重要性。显著图和t-SNE投影揭示了头皮上不同语义区域的分布情况。我们的模型展示了结构化语义中介如何使基于EEG的认知对齐视觉解码成为可能。
https://arxiv.org/abs/2507.07157
Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model's sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.
图像融合的目标是整合不同模态之间的互补信息,以生成高质量的融合图像,从而增强高层次视觉任务的表现。虽然全局空间建模机制显示出有前景的结果,但在空间域中构建长距离特征依赖关系会带来高昂的计算成本。此外,在没有真实标签的情况下,有效捕捉互补特征变得更加困难。为了应对这些挑战,我们提出了一种基于残差先验的频率感知网络,命名为RPFNet。 具体来说,RPFNet采用双分支特征提取框架:残差先验模块(RPM)从残差图中抽取特定模态的差异信息,从而提供用于融合的互补先验;频域融合模块(FDFM)通过频域卷积实现高效的全局特征建模和整合。此外,交叉促进模块(CPM)通过双向特征交互来增强局部细节与全局结构之间的协同感知能力。 在训练过程中,我们引入辅助解码器和显著性结构损失以增强模型对特定模态差异的敏感度。此外,自适应权重基于频率对比损失和SSIM损失的组合有效限制了解决方案空间,在确保保留互补信息的同时促进了局部细节与全局特征的一致捕捉。 广泛实验验证了RPFNet的融合性能:它能够有效地整合判别性特征、增强纹理细节及显著对象,并有助于高层视觉任务的有效部署。
https://arxiv.org/abs/2507.06735
Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
尽管深度学习在各个领域取得了成功,但它仍然容易受到对抗性攻击的威胁。虽然许多现有的对抗性攻击方法能够实现较高的成功率,但它们通常依赖于$\ell_{p}$-范数扰动约束条件,这并不符合人类的感知能力。因此,研究人员将注意力转向了生成自然且不受限制的对抗样本(UAEs)。基于GAN的方法面临诸如图像质量不佳、由于不稳定性和模式崩溃等问题的固有局限性。与此同时,虽然扩散模型也被用于生成UAE,但它们仍然依赖于迭代PGD扰动注入方法,并未充分利用其核心去噪能力。 在本文中,我们提出了一种基于扩散模型的新颖方法——ScoreAdv,用于生成不受限制的自然对抗样本。该方法结合了一个可解释的对抗性指导机制,逐步将采样分布向对抗性分布转移,并使用一个可解释的显著图(saliency map)将参考图像中的视觉信息注入到生成的样本中。值得注意的是,我们的方法能够生成无限数量的自然对抗样本,并能攻击分类模型和检索模型。 我们在ImageNet和CelebA数据集上进行了广泛的实验,在黑盒和白盒设置下验证了ScoreAdv在针对十种目标模型时的表现。实验结果表明,ScoreAdv实现了目前最先进的攻击成功率和图像质量水平。此外,去噪与对抗性扰动之间的动态平衡使ScoreAdv即使在防御措施面前也能保持其鲁棒性。
https://arxiv.org/abs/2507.06078
Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.
内容感知布局旨在根据给定的画布合理排列设计元素,以有效地传达信息。近年来,这一任务的趋势是利用大型语言模型(LLM)自动生成布局,并取得了显著的成绩。然而,现有的基于LLM的方法未能充分解读视觉主题和设计元素之间的空间关系,导致在布局生成时出现结构化不足和多样性问题。为解决这些问题,我们引入了ReLayout,这是一种新型方法,它通过关系-CoT来生成更合理且符合美学的设计布局,并从基本的设计概念出发进行改进。具体来说,我们通过引入明确的关系定义(如区域、显著性和元素间的边距)增强布局标注,旨在将布局分解为较小的、结构化和递归的部分,从而实现更具结构化的布局生成。 此外,在这些定义关系的基础上,我们提出了一种基于原型特征重新平衡采样的方法,该方法在三个维度上定义了布局原型,并量化了不同的布局风格。这种采样器解决了由于原型分布平衡过程中数据偏差而产生的生成一致性问题。 广泛的实验结果验证了ReLayout优于基准模型,可以生成结构化且多样化、更符合人类审美标准并更具可解释性的布局设计。
https://arxiv.org/abs/2507.05568
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at this https URL.
时间句法定位(Temporal Sentence Grounding,TSG)的目标是在未经修剪的视频中识别与给定文本查询语义对应的时刻。尽管现有的研究已经取得了实质性的进展,但它们往往忽略了视频和文本查询之间虚假关联的问题。这些虚假关联主要源于两个因素:(1) 文本数据中的固有偏见,例如特定动词或短语的频繁共现;(2) 模型倾向于过度拟合于视频内容中显眼或重复的模式。这种偏见误导模型将文本线索与不正确的视觉时刻联系起来,导致预测不可靠并对分布外示例的一般化能力差。 为了解决这些限制,我们提出了一种新颖的时间句法定位框架,该框架利用因果干预和反事实推理来消除虚假关联并增强模型的鲁棒性。具体而言,首先从因果角度出发,通过结构因果模型(Structural Causal Model)将TSG任务形式化。然后,为了解决指向特定动词或短语的文本偏差所反映的未观察到的混淆因子问题,提出了一种基于do-演算估计因果效应的文本因果干预方法。此外,还通过对仅关注视频特征而排除查询和融合多模态特征的反事实场景进行构建来进行视觉反事实推理,从而允许通过隔离并去除视频对整体影响的作用来校正模型偏差。 在公共数据集上的实验表明了所提出的方法的优越性。代码可在该链接中获取。
https://arxiv.org/abs/2507.04958
We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6\% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention.
我们介绍了Saliency Benchmark(简称SalBench),这是一个新颖的基准测试工具,旨在评估大型视觉-语言模型(LVLM)在检测人类一眼就能识别出的重要视觉特征方面的能力。这些重要视觉特征包括在一排小圆圈中突出的一个大圆等。该基准关注的是低级特征,如颜色、强度和方向,这些都是人类视觉处理的基础。我们的SalBench包含的图片高亮了场景中的罕见、不寻常或意外元素,并自然地吸引了人的注意力。它包含了三个新的任务来评估LVLM的感知能力:Odd-One-Out Detection(异常检测)、Referring Odd-One-Out(指代性异常检测)和Visual Referring Odd-One-Out(视觉指代性异常检测)。我们使用SalBench对最先进的LVLM进行了全面评估,结果揭示了一个令人惊讶的局限性:LVLM难以识别看似显而易见的视觉异常,在如此简单的任务上,即使是先进的GPT-4o模型也只能达到47.6%的准确率。SalBench将成为衡量与人类注意力微妙定义相一致的LVLM能力的重要一步。
https://arxiv.org/abs/2507.04741
Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.
医学影像中的视觉问题回答(VQA)旨在通过自动解析复杂的影像数据来支持临床诊断,这些影像数据是根据自然语言查询生成的。现有的研究通常依赖于独立提取医疗图像和临床问题特征的不同视觉和文本编码器,并将它们结合以生成答案。具体而言,在计算机断层扫描(CT)中,这种方法与传统的医学图像分析实践类似。然而,这些方法对体积CT数据中的空间连续性和片间相关性关注较少,导致回答碎片化且不够精确。 在本文中,我们提出了一种新的基于大型语言模型(LLM)的框架,并通过显著特征的图表示进行了增强。不同于传统的多模态编码策略,我们的方法构建了一个跨模式图,将视觉和文本特性结合在一起,在这个图结构中,每个CT切片和问题标记都被视为节点。我们进一步利用注意力图卷积网络来动态融合该结构中的信息。最终聚合的图特征作为软提示指导大型语言模型生成准确答案。 在M3D-VQA基准上的广泛实验表明,我们的方法在多个评估指标上始终优于基线方法,并提供了更强大的推理能力。
https://arxiv.org/abs/2507.04333
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
扩散模型在视觉生成任务中展示了卓越的性能,然而其高昂的计算复杂度阻碍了它在边缘设备上的广泛应用。量化作为一种有前景的技术,旨在加速推理过程并减少内存占用。不过,现有的量化方法在极低比特(2-4位)量化下表现不佳,直接应用这些方法会导致严重的性能下降。我们发现现有量化框架的问题在于不友好的异常值处理设计、次优的初始化以及优化策略。 为了应对这些问题,我们提出了MPQ-DMv2,这是一个改进的混合精度量化框架,专门针对极低比特扩散模型。从量化角度来看,由显著异常值引起的不平衡分布对均匀量化器是不利的。因此,我们提出了一种名为“灵活Z序残差混合量化”的方法,这种方法利用了一个高效的二进制残差分支来处理显著误差,并且可以实现更灵活的量化步骤。 在优化框架方面,我们从理论上分析了LoRA模块的收敛性和最优性,并提出了基于对象的低秩初始化(Object-Oriented Low-Rank Initialization),用先前的量化错误信息进行更有意义的初始化。此外,我们还提出了一种名为“基于内存的时间关系蒸馏”(Memory-based Temporal Relation Distillation)的方法,通过构建一个在线时间感知像素队列来提取长期去噪的时间信息,从而确保量化模型和全精度模型之间在整个过程中的时间一致性。 在各种生成任务上的综合实验表明,在不同的架构下,我们的MPQ-DMv2框架显著超越了当前最先进的方法,尤其是在极低比特宽度的情况下。
https://arxiv.org/abs/2507.04290
Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.
大多数基于图卷积网络(GCN)的方法将相互作用的个体视为独立的图形,忽略了它们之间的内在依赖关系。尽管最近的一些方法利用预定义的交互邻接矩阵来整合参与者,但这些矩阵无法适应性地捕捉不同动作中动态和情境特定的联合互动。 在本文中,我们提出了具有外部注意力网络(ASEA)的主动节点选择算法,这是一种创新的方法,能够不基于任何预先设定假设的情况下动态捕获相互作用关系。我们的方法通过使用GCN单独建模每位参与者来捕捉个体内部的关系,从而详细表示他们的动作特点。为了识别用于交互模型中最相关的节点,我们引入了自适应时间节点振幅计算(AT-NAC)模块,该模块结合空间运动幅度与自适应的时间加权,估计全局节点活动,突显重要的运动模式同时减少无关或冗余信息的影响。定义了一个可学习的阈值,并通过正则化防止极端变化来选择性地识别用于交互模型中最具信息量的节点。 为了捕捉互动关系,我们设计了外部注意力(EA)模块,在活跃节点上操作,有效地建模个体之间的相互作用动力学和语义关联。 广泛评估表明,我们的方法能够更有效、灵活地捕获交互关系,并取得了最先进的性能。
https://arxiv.org/abs/2507.03936
Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for "enhancing" that sample's predicted class remain relatively consistent, whereas OOD samples--unseen in training--exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.
分布外(OOD)检测对于在开放环境中安全部署深度模型至关重要,在这种环境下,输入数据可能超出训练时的数据分布。当在一个仅用内部分布(ID)数据进行训练的模型上进行推理时,我们观察到一个显著的梯度现象:围绕某个ID样本,用于“增强”该样本预测类别的局部梯度方向保持相对一致;相比之下,未在训练中见到的OOD样本则显示出混乱或冲突的梯度方向。受到这一发现启发,我们提出了一种推断阶段的技术,通过切断那些被虚假梯度利用以提升OOD信心的特征坐标,同时几乎不改变ID分类的情况。 为了避开在切断梯度之后重新计算logits带来的额外成本,我们进一步引入了一个局部一阶近似方法,该方法能够在没有二次前向传播的情况下准确捕捉到修改后的输出。标准OOD基准测试表明我们的方法带来了显著改进。此外,这种方法轻量级且对标准推理流程所需更改极小,为现实世界应用中的稳健OOD检测提供了一条实用路径。
https://arxiv.org/abs/2507.01417
In modern electronic manufacturing, defect detection on Printed Circuit Boards (PCBs) plays a critical role in ensuring product yield and maintaining the reliability of downstream assembly processes. However, existing methods often suffer from limited feature representation, computational redundancy, and insufficient availability of high-quality training data -- challenges that hinder their ability to meet industrial demands for both accuracy and efficiency. To address these limitations, we propose MRC-DETR, a novel and efficient detection framework tailored for bare PCB defect inspection, built upon the foundation of RT-DETR. Firstly, to enhance feature representation capability, we design a Multi-Residual Directional Coupled Block (MRDCB). This module improves channel-wise feature interaction through a multi-residual structure. Moreover, a cross-spatial learning strategy is integrated to capture fine-grained pixel-level relationships, further enriching the representational power of the extracted features. Secondly, to reduce computational redundancy caused by inefficient cross-layer information fusion, we introduce an Adaptive Screening Pyramid Network (ASPN). This component dynamically filters and aggregates salient low-level features, selectively fusing them with high-level semantic features. By focusing on informative regions and suppressing redundant computations, ASPN significantly improves both efficiency and detection accuracy. Finally, to tackle the issue of insufficient training data, particularly in the context of bare PCBs, we construct a new, high-quality dataset that fills a critical gap in current public resources. Our dataset not only supports the training and evaluation of our proposed framework but also serves as a valuable benchmark for future research in this domain.
在现代电子制造中,印刷电路板(PCB)上的缺陷检测对于确保产品质量和维护下游组装流程的可靠性至关重要。然而,现有的方法往往受到特征表示能力有限、计算冗余以及高质量训练数据不足等问题的影响,这些问题阻碍了这些方法满足工业对准确性和效率的需求。为了克服这些限制,我们提出了MRC-DETR,这是一种针对裸PCB缺陷检测而设计的新颖且高效的检测框架,并在此基础上构建于RT-DETR之上。 首先,为了增强特征表示能力,我们设计了一种多残差方向耦合块(MRDCB)。该模块通过一个多残差结构提升了通道间特征的交互性。此外,还集成了一种跨空间学习策略,以捕捉细粒度像素级关系,进一步丰富提取特征的表现力。 其次,为了解决由低效层间信息融合引起的计算冗余问题,我们引入了自适应筛选金字塔网络(ASPN)。该组件动态地过滤和聚合显著的低层次特征,并有选择性地将其与高层次语义特征进行融合。通过关注信息丰富的区域并抑制冗余计算,ASPN在提高效率的同时显著提升了检测准确性。 最后,为了解决训练数据不足的问题,特别是在裸PCB领域尤为明显的情况下,我们构建了一个新的高质量数据集,填补了现有公共资源中的一个关键空白。我们的数据集不仅支持所提出框架的训练和评估,还作为该领域的未来研究的重要基准。
https://arxiv.org/abs/2507.03386
As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.
随着大规模语言模型(LLMs)的规模不断扩大,高效的压缩技术如量化和稀疏化变得至关重要。虽然量化可以在降低精度的同时保持性能,但结构化的稀疏方法,例如N:M稀疏化,则由于灵活性有限以及对异常权重敏感的问题而往往效果不佳。我们研究了一种名为8:16半结构化稀疏的方法,展示了它能够超越“性能阈值”——即在同等内存约束条件下,压缩后的模型可以达到与其未压缩或规模较小的版本相同的准确率。相比2:4稀疏方法,8:16提供了更大的灵活性和几乎无额外存储开销(0.875比特/元素对比0.75比特/元素)。我们还针对显著权重应用了稀疏结构化模式,证明对于异常值而言,结构化的稀疏方法可以与非结构化的方法相媲美,并且能够带来同等或更好的结果。最后,我们展示了简单的技术如方差校正和类似SmoothQuant的权重均衡可以提升稀疏模型的表现。
https://arxiv.org/abs/2507.03052
Grey matter loss in the hippocampus is a hallmark of neurobiological aging, yet understanding the corresponding changes in its functional connectivity remains limited. Seed-based functional connectivity (FC) analysis enables voxel-wise mapping of the hippocampus's synchronous activity with cortical regions, offering a window into functional reorganization during aging. In this study, we develop an interpretable deep learning framework to predict brain age from hippocampal FC using a three-dimensional convolutional neural network (3D CNN) combined with LayerCAM saliency mapping. This approach maps key hippocampal-cortical connections, particularly with the precuneus, cuneus, posterior cingulate cortex, parahippocampal cortex, left superior parietal lobule, and right superior temporal sulcus, that are highly sensitive to age. Critically, disaggregating anterior and posterior hippocampal FC reveals distinct mapping aligned with their known functional specializations. These findings provide new insights into the functional mechanisms of hippocampal aging and demonstrate the power of explainable deep learning to uncover biologically meaningful patterns in neuroimaging data.
海马体灰质的损失是神经生物学老化的一个显著特征,然而对其功能连接的变化理解仍较为有限。基于种子点的功能连接(FC)分析能够逐像素地映射出海马体与皮层区域同步活动的关系,从而提供了一个观察大脑在衰老过程中功能性重组过程的窗口。在这项研究中,我们开发了一种可解释的深度学习框架,利用三维卷积神经网络(3D CNN)结合LayerCAM显著性图来从海马体的功能连接预测脑龄。这种方法能够映射出与年龄高度相关的关键海马体-皮层联系,特别是与楔前叶、楔叶、后扣带回皮质、旁海马回和左侧顶上小叶以及右侧颞上沟的连接最为敏感。尤为重要的是,分离分析前部和后部海马体的功能连接显示出不同的映射模式,并且这些模式与已知的功能特化相吻合。这些发现为海马体老化的功能机制提供了新的见解,并展示了可解释性深度学习在揭示神经影像数据中生物有意义的模式方面的力量。
https://arxiv.org/abs/2507.01411
Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classification (CTC) directly into codebook discretization for speech content tokenization. This novel architecture generates tokens with a unique "locality" property, as validated by experiments demonstrating optimal trade-offs among content faithfulness, temporal coherence, and structural recoverability. Then, to address data scarcity for the FAC module, we adopted a multitask learning strategy that jointly trains the FAC and TTS modules. Beyond mitigating data limitations, this approach yielded accelerated convergence and superior speech quality compared to standalone FAC training. Moreover, leveraging the salient properties of our discrete speech representations, we introduce SpeechRestorer, a postprocessing architecture designed to refine LLM-generated outputs. This module effectively mitigates stochastic errors prevalent in LLM inference pipelines while enhancing prosodic continuity, as validated by ablation experiments.
在语音处理中,口音转换(Foreign Accent Conversion,FAC)仍然是一个具有挑战性的任务。基于大型语言模型(Large Language Models,LLMs)在文本转语音(Text-to-Speech,TTS)任务中的显著成功,本研究探讨了将基于LLM的技术应用于FAC的适应性,并将其命名为SpeechAccentLLM。此框架的核心是引入SpeechCodeVAE,这是首个将连接时序分类(Connectionist Temporal Classification,CTC)直接集成到码本离散化中以实现语音内容标记化的模型。这一新颖架构生成具有独特“局部性”属性的令牌,实验验证了其在内容忠实度、时间连贯性和结构恢复能力之间实现了最优权衡。 为解决FAC模块的数据稀缺问题,我们采用了一种多任务学习策略,同时训练FAC和TTS模块。除了缓解数据限制外,这种方法还带来了更快的收敛速度,并且与单独进行FAC训练相比,在语音质量上表现出色。 此外,利用我们的离散语音表示的重要特性,我们引入了SpeechRestorer,这是一种后处理架构,旨在优化LLM生成的结果。该模块有效地减轻了LLM推理管道中常见的随机错误,并通过消融实验验证增强了韵律连续性。
https://arxiv.org/abs/2507.01348