We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at this https URL.
我们介绍了一种名为EditIQ的全自动框架,用于对通过固定位置、大视野和高分辨率摄像机捕捉到的画面进行电影级编辑。从静态摄像头的视频流中,EditIQ首先生成多个虚拟视频流,模拟一个摄制团队的工作模式。这些被称为“rushes”的虚拟摄影镜头随后会经过自动化剪辑算法的处理,目标是向观众呈现最生动的场景内容。为了理解关键场景元素并指导剪辑过程,我们采用了一种双管齐下的方法:(1)基于大型语言模型(LLM)的对话理解模块来分析对话流程,并结合(2)视觉显著性预测来识别有意义的场景元素和由此产生的摄影镜头。 接下来,我们将电影视频编辑定义为一个以镜头选择为变量的能量最小化问题,在这个过程中,电影制作约束条件决定了镜头的选择、过渡以及连续性。EditIQ在保持电影连贯性和流畅观看体验的同时,合成了原故事叙述的美学与视觉上的引人注目的表现形式。 通过BBC Old School数据集加上11个戏剧表演视频的一次涉及20名参与者的心理物理学研究,我们展示了EditIQ相较于竞争基线的有效性。可以从提供的链接找到EditIQ生成的视频样本。
https://arxiv.org/abs/2502.02172
Navigating densely vegetated environments poses significant challenges for autonomous ground vehicles. Learning-based systems typically use prior and in-situ data to predict terrain traversability but often degrade in performance when encountering out-of-distribution elements caused by rapid environmental changes or novel conditions. This paper presents a novel, lidar-only, online adaptive traversability estimation (TE) method that trains a model directly on the robot using self-supervised data collected through robot-environment interaction. The proposed approach utilises a probabilistic 3D voxel representation to integrate lidar measurements and robot experience, creating a salient environmental model. To ensure computational efficiency, a sparse graph-based representation is employed to update temporarily evolving voxel distributions. Extensive experiments with an unmanned ground vehicle in natural terrain demonstrate that the system adapts to complex environments with as little as 8 minutes of operational data, achieving a Matthews Correlation Coefficient (MCC) score of 0.63 and enabling safe navigation in densely vegetated environments. This work examines different training strategies for voxel-based TE methods and offers recommendations for training strategies to improve adaptability. The proposed method is validated on a robotic platform with limited computational resources (25W GPU), achieving accuracy comparable to offline-trained models while maintaining reliable performance across varied environments.
在植被密集的环境中导航对自主地面车辆构成了重大挑战。基于学习的方法通常使用先验和现场数据来预测地形可通行性,但在遇到由快速环境变化或新情况引起的分布外元素时,性能往往会下降。本文提出了一种新颖、仅使用激光雷达(LiDAR)的在线自适应可通行性估计(TE)方法,该方法直接在机器人上训练模型,通过机器人与环境之间的交互收集自我监督数据。所提出的方案利用概率3D体素表示法来整合激光雷达测量值和机器人的经验,创建一个显著的环境模型。为了确保计算效率,采用了一种稀疏图基表示法来更新暂时演化的体素分布。使用无人驾驶地面车辆在自然地形中进行广泛的实验表明,该系统只需8分钟的操作数据即可适应复杂环境,并实现了0.63的马修斯相关系数(MCC)评分,在密集植被环境中实现了安全导航。这项工作研究了基于体素的TE方法的不同训练策略,并为提高适应性的培训策略提供了建议。所提出的方法在计算资源有限(25W GPU)的机器人平台上得到了验证,其准确性与离线训练模型相当,同时在各种环境下保持可靠的性能。
https://arxiv.org/abs/2502.01987
While attention-based approaches have shown considerable progress in enhancing image fusion and addressing the challenges posed by long-range feature dependencies, their efficacy in capturing local features is compromised by the lack of diverse receptive field extraction techniques. To overcome the shortcomings of existing fusion methods in extracting multi-scale local features and preserving global features, this paper proposes a novel cross-modal image fusion approach based on a multi-scale convolutional neural network with attention Transformer (MATCNN). MATCNN utilizes the multi-scale fusion module (MSFM) to extract local features at different scales and employs the global feature extraction module (GFEM) to extract global features. Combining the two reduces the loss of detail features and improves the ability of global feature representation. Simultaneously, an information mask is used to label pertinent details within the images, aiming to enhance the proportion of preserving significant information in infrared images and background textures in visible images in fused images. Subsequently, a novel optimization algorithm is developed, leveraging the mask to guide feature extraction through the integration of content, structural similarity index measurement, and global feature loss. Quantitative and qualitative evaluations are conducted across various datasets, revealing that MATCNN effectively highlights infrared salient targets, preserves additional details in visible images, and achieves better fusion results for cross-modal images. The code of MATCNN will be available at this https URL.
尽管基于注意力的方法在增强图像融合和解决长距离特征依赖性带来的挑战方面取得了显著进展,但由于缺乏多样化的感受野提取技术,其捕捉局部特征的能力受到限制。为了解决现有融合方法在提取多尺度局部特征和保持全局特征方面的不足,本文提出了一种基于多尺度卷积神经网络与注意力Transformer(MATCNN)的新型跨模态图像融合方法。MATCNN利用多尺度融合模块(MSFM)在不同尺度上提取局部特征,并采用全局特征提取模块(GFEM)提取全局特征。通过结合这两个模块减少了细节特征的丢失,提高了全局特征表示的能力。 同时,使用信息掩码对图像中的重要细节进行标记,旨在提高红外图像中显著信息和可见光图像中背景纹理在融合图像中保存的比例。随后,开发了一种新的优化算法,利用该掩码指导特征提取,并通过内容、结构相似性指数度量以及全局特征损失的集成来实现这一目标。 通过对多个数据集进行定量和定性的评估发现,MATCNN能够有效突出红外显著目标,在可见光图像中保留更多细节,并在跨模态图像融合方面取得了更好的结果。MATCNN的代码将在[此处](https://this.http URL.com)提供(请将URL替换为实际地址)。
https://arxiv.org/abs/2502.01959
In this work, we present INTACT, a novel two-phase framework designed to enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data in safety-critical perception tasks. INTACT combines meta-learning with adversarial curriculum training (ACT) to systematically address challenges posed by data corruption and sparsity in 3D point clouds. The meta-learning phase equips a teacher network with task-agnostic priors, enabling it to generate robust saliency maps that identify critical data regions. The ACT phase leverages these saliency maps to progressively expose a student network to increasingly complex noise patterns, ensuring targeted perturbation and improved noise resilience. INTACT's effectiveness is demonstrated through comprehensive evaluations on object detection, tracking, and classification benchmarks using diverse datasets, including KITTI, Argoverse, and ModelNet40. Results indicate that INTACT improves model robustness by up to 20% across all tasks, outperforming standard adversarial and curriculum training methods. This framework not only addresses the limitations of conventional training strategies but also offers a scalable and efficient solution for real-world deployment in resource-constrained safety-critical systems. INTACT's principled integration of meta-learning and adversarial training establishes a new paradigm for noise-tolerant 3D perception in safety-critical applications. INTACT improved KITTI Multiple Object Tracking Accuracy (MOTA) by 9.6% (64.1% -> 75.1%) and by 12.4% under Gaussian noise (52.5% -> 73.7%). Similarly, KITTI mean Average Precision (mAP) rose from 59.8% to 69.8% (50% point drop) and 49.3% to 70.9% (Gaussian noise), highlighting the framework's ability to enhance deep learning model resilience in safety-critical object tracking scenarios.
在这项工作中,我们介绍了INTACT,这是一种创新的两阶段框架,旨在提高深度神经网络(DNN)在安全关键感知任务中对噪声激光雷达数据的鲁棒性。INTACT结合了元学习与对抗性课程训练(Adversarial Curriculum Training, ACT),系统地解决了3D点云中的数据损坏和稀疏性的挑战。元学习阶段使一个教师网络具备任务无关的先验知识,使其能够生成具有抗噪能力的关键区域显著图。在ACT阶段中,这些显著图被用来逐步向学生网络暴露日益复杂的噪声模式,确保了有针对性的扰动,并增强了对噪音的抵抗能力。 INTACT的有效性通过使用KITTI、Argoverse和ModelNet40等多样化数据集进行对象检测、跟踪和分类基准上的综合评估来展示。结果显示,在所有任务中,INTACT使模型鲁棒性提高了高达20%,优于标准对抗训练和课程学习方法。此框架不仅解决了传统训练策略的局限性,还为资源受限的安全关键系统提供了可扩展且高效的解决方案。 INTACT通过将元学习与对抗训练原理相结合,建立了安全关键应用中的耐噪三维感知的新范式。在KITTI多目标跟踪精度(MOTA)方面,INTACT从64.1%提升到75.1%,并在高斯噪声条件下提升了9.6个百分点(从52.5%到73.7%)。同样地,在KITTI平均精度(mAP)上,该框架使得性能从59.8%增加至69.8%,在高斯噪声下则从49.3%提高到了70.9%,这突显了INTACT能够增强深度学习模型在安全关键对象跟踪场景中的鲁棒性。
https://arxiv.org/abs/2502.01896
In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.
在工业环境中,弱监督(WS)方法通常比完全监督(FS)方法更受欢迎,因为它们不需要昂贵的手动标注。不幸的是,在弱监督模式下获得的分割掩码往往在准确性方面较差。在这项工作中,我们提出了一种能够在视频流中生成准确掩码以进行语义分割的弱监督方法。具体来说,我们构建了利用视频连续帧之间的时间连贯性的显著图,当对象出现在不同帧时,这种方法能够促进一致性。 我们在一个废物分类场景中应用该方法,通过训练辅助分类器来执行弱监督视频分割(WSVS),这个分类器可以区分操作员手动从传送带上移除特定废物前后录制的视频。此分类器生成的显著图会识别需要被移除的材料,并且我们修改了分类器的训练过程以最小化中心帧与相邻帧之间的显著图差异,同时补偿对象位移。 在真实世界数据集上的实验表明,在分类器的训练阶段直接整合时间连贯性可以带来益处。代码和数据集可应要求提供。
https://arxiv.org/abs/2502.01455
This thesis explores advanced approaches to improve explainability in computer vision by analyzing and modeling the features exploited by deep neural networks. Initially, it evaluates attribution methods, notably saliency maps, by introducing a metric based on algorithmic stability and an approach utilizing Sobol indices, which, through quasi-Monte Carlo sequences, allows a significant reduction in computation time. In addition, the EVA method offers a first formulation of attribution with formal guarantees via verified perturbation analysis. Experimental results indicate that in complex scenarios these methods do not provide sufficient understanding, particularly because they identify only "where" the model focuses without clarifying "what" it perceives. Two hypotheses are therefore examined: aligning models with human reasoning -- through the introduction of a training routine that integrates the imitation of human explanations and optimization within the space of 1-Lipschitz functions -- and adopting a conceptual explainability approach. The CRAFT method is proposed to automate the extraction of the concepts used by the model and to assess their importance, complemented by MACO, which enables their visualization. These works converge towards a unified framework, illustrated by an interactive demonstration applied to the 1000 ImageNet classes in a ResNet model.
这篇论文探讨了通过分析和建模深度神经网络所利用的特征来改进计算机视觉中的可解释性。初期评估了归因方法,特别是使用基于算法稳定性的指标以及采用Sobol指数的方法(该方法通过拟蒙特卡洛序列显著减少了计算时间)。此外,EVA方法提供了第一个带有正式保证的归因公式化,通过验证扰动分析实现。实验结果表明,在复杂场景中这些方法未能提供足够的理解,特别是因为它们只能识别模型关注“哪里”,而不能阐明模型感知到的“什么”。因此,论文探讨了两个假设:使模型与人类推理一致——通过引入一种训练程序,该程序结合模仿人类解释和在1-Lipschitz函数空间内的优化;以及采用概念可解释性方法。提出了CRAFT方法来自动化提取模型使用的概念,并评估其重要性,MACO方法则用于可视化这些概念。这些工作汇聚成一个统一的框架,并通过应用于ResNet模型中的1000个ImageNet类别的交互式演示进行说明。
https://arxiv.org/abs/2502.01048
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
本文介绍了ViNet-S,这是一种基于ViNet架构并采用U-Net设计的36MB模型,其轻量级解码器在不牺牲性能的前提下大幅减少了模型大小和参数。此外,ViNet-A(148MB)融合了时空动作定位(STAL)特征,与传统的视频显著性模型不同,后者依赖于动作分类骨干网络。我们的研究表明,在三个仅视觉数据集以及六个视听显著性数据集中,通过平均预测的显著性图,由ViNet-S和ViNet-A组成的集成模型达到了最先进的性能水平,并且在参数效率和实时性能方面均优于基于变换器的模型,其中ViNet-S可以达到每秒超过1000帧的速度。
https://arxiv.org/abs/2502.00397
At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we illustrate that each row of the self-attention matrix can be represented as a mixture of experts. Our analysis shows that ''experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention. We corroborate our theoretical findings through extensive experiments on both synthetic and real-world datasets.
Transformer架构的核心是自我注意机制,该机制动态地为每个输入令牌分配softmax权重,使模型能够专注于最重要的信息。然而,由于其行式的结构,softmax会减慢注意力计算的速度,并且在本质上引入了令牌之间的竞争:一个令牌的权重增加时,其他令牌的权重就会减少。这种竞争性的动态可能导致自我注意只集中在有限的一组特征上,从而忽略掉其他有信息量的特点。最近的一些实验研究表明,使用逐元素的sigmoid函数有助于消除令牌间的竞争并降低计算开销。尽管这些经验结果很有前景,但在文献中尚未进行过关于sigmoid和softmax自注意力机制之间的严格比较研究。 本文填补了这一空白,从理论上证明了sigmoid自我注意比其softmax版本更有效地利用样本数据。为实现这一目标,我们展示了自关注矩阵的每一行都可以表示为专家混合体的一部分。我们的分析表明,在sigmoid自我注意中,“专家”需要显著较少的数据来达到与softmax自我注意中的“专家”相同的近似误差水平。通过在合成和现实世界数据集上的广泛实验,我们验证了理论发现的有效性。
https://arxiv.org/abs/2502.00281
Concept-based explanation methods, such as concept bottleneck models (CBMs), aim to improve the interpretability of machine learning models by linking their decisions to human-understandable concepts, under the critical assumption that such concepts can be accurately attributed to the network's feature space. However, this foundational assumption has not been rigorously validated, mainly because the field lacks standardised metrics and benchmarks to assess the existence and spatial alignment of such concepts. To address this, we propose three metrics: the concept global importance metric, the concept existence metric, and the concept location metric, including a technique for visualising concept activations, i.e., concept activation mapping. We benchmark post-hoc CBMs to illustrate their capabilities and challenges. Through qualitative and quantitative experiments, we demonstrate that, in many cases, even the most important concepts determined by post-hoc CBMs are not present in input images; moreover, when they are present, their saliency maps fail to align with the expected regions by either activating across an entire object or misidentifying relevant concept-specific regions. We analyse the root causes of these limitations, such as the natural correlation of concepts. Our findings underscore the need for more careful application of concept-based explanation techniques especially in settings where spatial interpretability is critical.
基于概念的解释方法,如概念瓶颈模型(CBMs),旨在通过将机器学习模型的决策与人类可理解的概念联系起来,来提高其透明度。这一方法的关键假设是这些概念可以被准确地归因于网络特征空间中。然而,这种基本假设尚未得到严格的验证,主要是因为该领域缺乏评估这些概念的存在性和空间对齐的标准指标和基准。为了应对这一挑战,我们提出了三个衡量标准:概念全局重要性度量、概念存在性度量以及概念位置度量,并引入了一种可视化概念激活的技术,即概念激活映射。通过后验CBMs的基准测试来展示它们的能力与面临的挑战。通过定性和定量实验,我们证明在许多情况下,即使是最关键的概念,也并不会出现在输入图像中;即便存在这些概念,其显著性图也无法与其预期区域对齐,要么在整个对象上激活,要么错误识别相关概念特定区域。 我们的分析揭示了这些问题的根本原因,例如概念的自然关联。研究结果强调,在空间可解释性至关重要的应用场景下,需要更加谨慎地应用基于概念的解释技术。
https://arxiv.org/abs/2501.19271
With over 2 million new cases identified annually, skin cancer is the most prevalent type of cancer globally and the second most common in Bangladesh, following breast cancer. Early detection and treatment are crucial for enhancing patient outcomes; however, Bangladesh faces a shortage of dermatologists and qualified medical professionals capable of diagnosing and treating skin cancer. As a result, many cases are diagnosed only at advanced stages. Research indicates that deep learning algorithms can effectively classify skin cancer images. However, these models typically lack interpretability, making it challenging to understand their decision-making processes. This lack of clarity poses barriers to utilizing deep learning in improving skin cancer detection and treatment. In this article, we present a method aimed at enhancing the interpretability of deep learning models for skin cancer classification in Bangladesh. Our technique employs a combination of saliency maps and attention maps to visualize critical features influencing the model's diagnoses.
每年有超过200万的新病例,皮肤癌是全球最常见的癌症类型,在孟加拉国则是仅次于乳腺癌的第二大常见癌症。早期发现和治疗对于改善患者的预后至关重要;然而,孟加拉国面临着皮肤病专家和能够诊断及治疗皮肤癌的专业医护人员短缺的问题。因此,许多病例只能在晚期才被确诊。研究表明,深度学习算法可以有效地对皮肤癌图像进行分类,但这些模型通常缺乏可解释性,使得理解其决策过程变得困难。这种不透明度阻碍了利用深度学习技术来提高皮肤癌的诊断和治疗水平。本文提出了一种旨在增强孟加拉国皮肤癌分类中深度学习模型可解释性的方法。我们的技术结合使用热图(saliency maps)和注意力图(attention maps),以可视化影响模型诊断的关键特征。
https://arxiv.org/abs/2501.18161
Despite significant advancements in environment perception capabilities for autonomous driving and intelligent robotics, cameras and LiDARs remain notoriously unreliable in low-light conditions and adverse weather, which limits their effectiveness. Radar serves as a reliable and low-cost sensor that can effectively complement these limitations. However, radar-based object detection has been underexplored due to the inherent weaknesses of radar data, such as low resolution, high noise, and lack of visual information. In this paper, we present TransRAD, a novel 3D radar object detection model designed to address these challenges by leveraging the Retentive Vision Transformer (RMT) to more effectively learn features from information-dense radar Range-Azimuth-Doppler (RAD) data. Our approach leverages the Retentive Manhattan Self-Attention (MaSA) mechanism provided by RMT to incorporate explicit spatial priors, thereby enabling more accurate alignment with the spatial saliency characteristics of radar targets in RAD data and achieving precise 3D radar detection across Range-Azimuth-Doppler dimensions. Furthermore, we propose Location-Aware NMS to effectively mitigate the common issue of duplicate bounding boxes in deep radar object detection. The experimental results demonstrate that TransRAD outperforms state-of-the-art methods in both 2D and 3D radar detection tasks, achieving higher accuracy, faster inference speed, and reduced computational complexity. Code is available at this https URL
尽管自主驾驶和智能机器人在环境感知能力方面取得了显著进展,但摄像头和激光雷达(LiDAR)在低光条件和恶劣天气下的可靠性仍然存在问题,这限制了它们的效果。雷达作为一种可靠且低成本的传感器,可以有效地弥补这些不足。然而,由于雷达数据本身的固有弱点,如分辨率低、噪声高以及缺乏视觉信息,基于雷达的目标检测尚未得到充分研究。 在这篇论文中,我们提出了TransRAD,这是一种新颖的3D雷达目标检测模型,旨在通过利用记忆视觉变换器(RMT)从密集信息的雷达范围-方位-多普勒(RAD)数据中更有效地学习特征来应对这些挑战。我们的方法利用了由RMT提供的位置感知曼哈顿自注意力(MaSA)机制,以引入显式的空间先验知识,从而能够更好地与RAD数据中的目标的空间显著性特性对齐,并实现精确的3D雷达检测,在范围-方位-多普勒维度上均表现良好。此外,我们还提出了基于位置的信息消融非极大值抑制(Location-Aware NMS),以有效解决深度雷达目标检测中常见的边界框重复问题。 实验结果显示,TransRAD在2D和3D雷达检测任务中都优于现有最先进的方法,在精度、推理速度以及计算复杂度方面均有显著提高。代码可在[此链接](https://this-https-url.com)获取(请将"this https URL"替换为实际的网址)。
https://arxiv.org/abs/2501.17977
CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency. Project page: \url{this https URL}
CutMix 是一种数据增强策略,通过切割并粘贴图像补丁来混合训练数据。现有方法通常选择随机区域或显着区域,这些区域往往与标签不一致,从而误导模型的训练。据我们所知,我们首次将人类注视整合到 CutMix 中以指导其操作。由于人类注意力既受高层次识别驱动又受低层次线索影响,我们提出了一种可控的自上而下注意引导模块(Top-down Attention Guided Module),以获得一种平衡了自上而下和自下而上注意力的一般人工注意力。所提出的 TdAttenMix 模块会选取与当前标签相关的区域作为补丁,并调整标签混合比例,使模型更加专注于这些相关区域。实验结果显示,在八种不同基准测试中,我们的TdAttenMix 方法优于现有的最先进(state-of-the-art)的混合方法。 此外,我们还引入了一项新的基于人类注视的新指标,并利用该指标来探讨图像-标签不一致的问题。项目页面:\url{this https URL}
https://arxiv.org/abs/2501.15409
Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or "Sim2Real" distribution shift -- introduced by employing simulation-trained policies in real-world environments -- frequently prevents successful policy transfer. This study explores the potential of using large-scale pre-training of vision encoders to address the Sim2Real gap. We examine a diverse collection of encoders, evaluating their ability to (1) extract features necessary for robot control while (2) remaining invariant to task-irrelevant environmental variations. We quantitatively measure the encoder's feature extraction capabilities through linear probing and its domain invariance by computing distances between simulation and real-world embedding centroids. Additional qualitative insights are provided through t-SNE plots and GradCAM saliency maps. Findings suggest that encoders pre-trained on manipulation-specific datasets generally outperform those trained on generic datasets in bridging the Sim2Real gap. this https URL
模拟为学习视觉运动机器人策略提供了一种可扩展且高效的替代方案,以减少实际数据收集的需求。然而,将基于仿真训练的政策应用到真实环境中时,由于“Sim2Real”(从仿真到现实)分布变化的影响,通常会导致成功的策略转移无法实现。这项研究探讨了利用大规模预训练视觉编码器来解决Sim2Real差距的潜力。我们评估了一系列不同种类编码器的能力,具体包括:(1) 提取机器人控制所需的特征;以及 (2) 在与任务无关的环境变化面前保持不变性。 通过线性探测定量测量编码器的特征提取能力,并计算仿真和真实世界嵌入质心之间的距离来度量其领域的不变性。此外,我们还通过t-SNE图和GradCAM热力图提供定性的见解。研究结果表明,在专门用于操作的数据集上进行预训练的编码器通常在弥合Sim2Real差距方面优于通用数据集上训练的编码器。 这项工作的目标是提高机器人系统从仿真环境向真实世界部署时的学习效率和适应性,特别是在视觉信息处理方面的性能优化。
https://arxiv.org/abs/2501.16389
Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing quantization difficulty. Evaluations of 2-bit quantization on 12 long-context and multimodal tasks demonstrate that AKVQ-VL maintains or even improves accuracy, outperforming LLM-oriented methods. AKVQ-VL can reduce peak memory usage by 2.13x, support up to 3.25x larger batch sizes and 2.46x throughput.
视觉语言模型(VLM)在多模态任务中表现出卓越的性能。然而,过多长的多模态输入会导致巨大的键值(KV)缓存,从而造成显著的记忆消耗和I/O瓶颈。虽然先前用于大型语言模型(LLM)的KV量化方法可能缓解这些问题,但它们忽略了多模态标记的关注度差异,导致了次优表现。在本文中,我们研究了VLM中的注意力感知标记关注度模式,并提出了AKVQ-VL。AKVQ-VL利用提出的文本显著性注意(TSA)和枢轴令牌显著性注意(PSA)模式,以自适应地分配比特预算。此外,实现极低比特量化需要有效地处理KV张量中的异常值。AKVQ-VL采用Walsh-Hadamard变换(WHT)构建无异常的KV缓存,从而降低量化难度。在12个长上下文和多模态任务上的2位量化评估显示,AKVQ-VL保持或甚至提高了准确性,并超越了LLM导向的方法。AKVQ-VL可以减少峰值内存使用量至原来的1/2.13倍,支持最大3.25倍的批处理大小以及最高2.46倍的数据吞吐量。
https://arxiv.org/abs/2501.15021
Skin cancer is one of the most prevalent and potentially life-threatening diseases worldwide, necessitating early and accurate diagnosis to improve patient outcomes. Conventional diagnostic methods, reliant on clinical expertise and histopathological analysis, are often time-intensive, subjective, and prone to variability. To address these limitations, we propose a novel hybrid deep learning framework that integrates convolutional neural networks (CNNs) with Radial Basis Function (RBF) Networks to achieve high classification accuracy and enhanced interpretability. The motivation for incorporating RBF Networks lies in their intrinsic interpretability and localized response to input features, which make them well-suited for tasks requiring transparency and fine-grained decision-making. Unlike traditional deep learning models that rely on global feature representations, RBF Networks allow for mapping segments of images to chosen prototypes, exploiting salient features within a single image. This enables clinicians to trace predictions to specific, interpretable patterns. The framework incorporates segmentation-based feature extraction, active learning for prototype selection, and K-Medoids clustering to focus on these salient features. Evaluations on the ISIC 2016 and ISIC 2017 datasets demonstrate the model's effectiveness, achieving classification accuracies of 83.02\% and 72.15\% using ResNet50, respectively, and outperforming VGG16-based configurations. By generating interpretable explanations for predictions, the framework aligns with clinical workflows, bridging the gap between predictive performance and trustworthiness. This study highlights the potential of hybrid models to deliver actionable insights, advancing the development of reliable AI-assisted diagnostic tools for high-stakes medical applications.
皮肤癌是全球最常见的严重疾病之一,需要早期且准确的诊断来改善患者的治疗效果。传统的诊断方法依赖于临床经验和组织病理学分析,这些方法通常耗时长、主观性强,并且易受变异影响。为了克服这些问题,我们提出了一种新的混合深度学习框架,该框架结合了卷积神经网络(CNN)和径向基函数(RBF)网络,以实现高分类准确性和增强的可解释性。 引入RBF网络的主要动机在于其内在的可解释性和对输入特征的局部响应特性,这使得它非常适合于需要透明度和细粒度决策的任务。与依赖全局特征表示的传统深度学习模型不同,RBF网络可以将图像片段映射到选定的原型上,利用单个图像中的显著特征。这种机制使临床医生能够追踪预测到具体的、可解释的模式。 该框架结合了基于分割的特征提取方法,并采用了主动学习和K-Medoids聚类来专注于这些显著特征。在ISIC 2016和ISIC 2017数据集上的评估显示,使用ResNet50模型时,分类准确率分别达到了83.02%和72.15%,且优于基于VGG16的配置。 通过为预测生成可解释的理由,该框架与临床工作流程保持一致,从而弥合了预测性能和可信度之间的差距。这项研究强调了混合模型在提供实际见解方面具有巨大潜力,有助于开发可靠的人工智能辅助诊断工具,以应对高风险医疗应用的需求。
https://arxiv.org/abs/2501.14885
Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.
确保增强型大型语言模型(LLM)在检索扩充场景下的语境忠实性对于构建值得信赖的信息查询系统至关重要,特别是在长篇问答(LFQA)场景中。在这项工作中,我们发现了一个显著的关联:LFQA的忠实性和负责检索上下文信息的一组注意力头(即检索头部)之间存在密切联系。基于这一洞察,我们提出了RHIO框架,旨在教导LLM明确区分忠实生成和不忠实生成。 RHIO首先通过选择性地屏蔽检索头部来增强模拟真实模型内在错误的不忠实样本。然后将这些样本纳入联合训练中,使模型能够在控制令牌的条件下区分出不忠实输出和忠实输出。此外,利用这些控制令牌进行自我诱导对比输出,并通过对比解码放大它们之间的差异。 为了便于评估语境忠实性,我们还引入了GroundBench,这是一个综合基准测试,由五个现有的LFQA数据集编译而成。在GroundBench上的广泛实验结果表明,RHIO显著提高了忠实度,甚至超过了GPT-4o的表现。
https://arxiv.org/abs/2501.13573
Existing structured pruning typically involves multi-stage training procedures that often demand heavy computation. Pruning at initialization, which aims to address this limitation, reduces training costs but struggles with performance. To address these challenges, we propose an efficient framework for one-cycle structured pruning without compromising model performance. In this approach, we integrate pre-training, pruning, and fine-tuning into a single training cycle, referred to as the `one cycle approach'. The core idea is to search for the optimal sub-network during the early stages of network training, guided by norm-based group saliency criteria and structured sparsity regularization. We introduce a novel pruning indicator that determines the stable pruning epoch by assessing the similarity between evolving pruning sub-networks across consecutive training epochs. Also, group sparsity regularization helps to accelerate the pruning process and results in speeding up the entire process. Extensive experiments on datasets, including CIFAR-10/100, and ImageNet, using VGGNet, ResNet, MobileNet, and ViT architectures, demonstrate that our method achieves state-of-the-art accuracy while being one of the most efficient pruning frameworks in terms of training time. The source code will be made publicly available.
现有的结构化剪枝通常涉及多阶段训练过程,这往往需要大量的计算资源。初始化时进行剪枝的方法旨在解决这一限制问题,通过减少训练成本来提高效率,但同时在性能方面面临挑战。为了解决这些问题,我们提出了一种高效的单周期结构化剪枝框架,在不牺牲模型性能的前提下实现高效剪枝。该方法将预训练、剪枝和微调整合到一个训练周期中,称为“一周期法”。其核心思想是在网络训练的早期阶段通过基于范数的组重要性准则和结构化稀疏正则化来搜索最优子网。 我们引入了一个新颖的剪枝指标,该指标通过评估连续训练周期之间演进的剪枝子网络之间的相似度来确定稳定的剪枝时期。此外,组稀疏正则化有助于加速剪枝过程,并进而加快整个流程的速度。在使用VGGNet、ResNet、MobileNet和ViT架构的CIFAR-10/100以及ImageNet数据集上的广泛实验表明,我们的方法不仅实现了最先进的准确率,而且是训练时间最短的高效剪枝框架之一。 源代码将公开发布以供社区使用。
https://arxiv.org/abs/2501.13439
The standard ``serial'' (aka ``singleton'') model of belief contraction models the manner in which an agent's corpus of beliefs responds to the removal of a single item of information. One salient extension of this model introduces the idea of ``parallel'' (aka ``package'' or ``multiple'') change, in which an entire set of items of information are simultaneously removed. Existing research on the latter has largely focussed on single-step parallel contraction: understanding the behaviour of beliefs after a single parallel contraction. It has also focussed on generalisations to the parallel case of serial contraction operations whose characteristic properties are extremely weak. Here we consider how to extend serial contraction operations that obey stronger properties. Potentially more importantly, we also consider the iterated case: the behaviour of beliefs after a sequence of parallel contractions. We propose a general method for extending serial iterated belief change operators to handle parallel change based on an n-ary generalisation of Booth & Chandler's TeamQueue binary order aggregators.
标准的“序列”(又称“单一”)信念收缩模型描述了代理人在移除单个信息项时其信念集合的反应方式。这一模型的一个显著扩展引入了“并行”(也称“包”或“多重”)变更的概念,其中整个一组信息项同时被移除。现有研究主要集中在单一步骤的并行收缩:理解在一次并行收缩后信念的行为表现。此外,这些研究还关注了将序列收缩操作的一般化特性扩展到并行情况,而这些序列收缩操作的特征属性极为薄弱。 在此背景下,我们考虑如何将遵守更强性质的序列收缩操作进行扩展。更值得关注的是,我们也探讨了迭代的情况:在一系列平行收缩后信念的行为表现。为此,我们提出了一种基于布斯与钱德勒(Booth & Chandler)团队队列二元顺序聚合器的n元泛化的通用方法来处理并行变更,以扩展序列迭代信念改变操作。 这个提议的方法旨在为处理复杂的信念系统变化提供一个更加灵活且功能强大的框架。
https://arxiv.org/abs/2501.13295
Multivariate time series anomaly detection is essential for failure management in web application operations, as it directly influences the effectiveness and timeliness of implementing remedial or preventive measures. This task is often framed as a semi-supervised learning problem, where only normal data are available for model training, primarily due to the labor-intensive nature of data labeling and the scarcity of anomalous data. Existing semi-supervised methods often detect anomalies by capturing intra-variate temporal dependencies and/or inter-variate relationships to learn normal patterns, flagging timestamps that deviate from these patterns as anomalies. However, these approaches often fail to capture salient intra-variate temporal and inter-variate dependencies in time series due to their focus on excessively fine granularity, leading to suboptimal performance. In this study, we introduce MtsCID, a novel semi-supervised multivariate time series anomaly detection method. MtsCID employs a dual network architecture: one network operates on the attention maps of multi-scale intra-variate patches for coarse-grained temporal dependency learning, while the other works on variates to capture coarse-grained inter-variate relationships through convolution and interaction with sinusoidal prototypes. This design enhances the ability to capture the patterns from both intra-variate temporal dependencies and inter-variate relationships, resulting in improved performance. Extensive experiments across seven widely used datasets demonstrate that MtsCID achieves performance comparable or superior to state-of-the-art benchmark methods.
多变量时间序列异常检测对于网络应用操作中的故障管理至关重要,因为它直接影响到实施补救或预防措施的有效性和及时性。这项任务通常被设定为半监督学习问题,在这种情况下,只有正常的数据可用于模型训练,主要原因是数据标注工作繁重以及异常数据的稀缺性。现有的半监督方法往往通过捕捉单一变量的时间依赖关系和/或多变量之间的关联来检测异常,从而识别偏离这些模式的时间戳作为异常点。然而,由于过分关注细粒度细节,这些方法常常无法捕获时间序列中显著的单变量内部时间和多变量之间依赖关系,导致性能不佳。 在本研究中,我们提出了MtsCID(Multivariate Time Series Coarse-grained Intra- and Inter-variate Dependencies),这是一种新颖的半监督多变量时间序列异常检测方法。MtsCID采用双重网络架构:一个网络基于多个尺度单变量补丁的关注图进行粗粒度的时间依赖性学习;另一个网络则通过卷积和与正弦原型相互作用的方式,在变量之间捕捉粗粒度的互变量关系。这种设计增强了模型捕捉单变量内部时间依赖性和多变量间关联模式的能力,从而提高了性能。 广泛的实验在七个广泛使用的数据集上进行,结果表明MtsCID达到了与当前最先进的基准方法相当或更优的性能。
https://arxiv.org/abs/2501.16364
Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to traditional task agnostic encoding methods, paving the way for more efficient task-aware video compression.
视频编码器通过在比特率约束下最小化重构误差来优化压缩,使之符合人类的感知。然而,在许多现代应用中,如自动驾驶领域,大多数视频是作为AI系统进行对象识别或分割等任务的输入数据,而不是供人观看的。因此,为了适应下游任务的需求而非单纯追求视觉质量,对编码器进行针对性优化是有用的。 然而,一个主要挑战是如何将这种针对特定任务的优化与现有的标准视频编解码器(它们高效且流行)结合起来。在这里,我们通过在宏块级别上控制量化参数(QPs)来解决这一问题,以此优化下游任务。这种颗粒化的控制使我们在每一帧内可以优先对任务相关的区域进行编码。 我们将此优化问题表述为一个强化学习(RL)任务,在该任务中,代理学会在选择QPs时权衡长期对任务性能和比特率约束的影响。值得注意的是,我们的策略不需要下游任务作为推理过程中的输入信息,使其适用于流媒体应用以及车辆等边缘设备。 我们通过两个任务——汽车检测和感兴趣区域(ROI, saliency)编码展示了显著的改进效果。相比传统的不考虑特定任务的传统编码方法,在给定比特率下,我们的方法能够改善任务性能,为更高效的面向任务的视频压缩铺平了道路。
https://arxiv.org/abs/2501.12216