Many explainable AI (XAI) techniques strive for interpretability by providing concise salient information, such as sparse linear factors. However, users either only see inaccurate global explanations, or highly-varying local explanations. We propose to provide more detailed explanations by leveraging the human cognitive capacity to accumulate knowledge by incrementally receiving more details. Focusing on linear factor explanations (factors $\times$ values = outcome), we introduce Incremental XAI to automatically partition explanations for general and atypical instances by providing Base + Incremental factors to help users read and remember more faithful explanations. Memorability is improved by reusing base factors and reducing the number of factors shown in atypical cases. In modeling, formative, and summative user studies, we evaluated the faithfulness, memorability and understandability of Incremental XAI against baseline explanation methods. This work contributes towards more usable explanation that users can better ingrain to facilitate intuitive engagement with AI.
许多可解释人工智能(XAI)技术通过提供简洁明了的信息,如稀疏线性因素,试图实现可解释性。然而,用户可能只看到不准确的全局解释,或者高度分散的局部解释。我们通过利用人类知识累积能力,通过逐步接收更多细节来提供更多详细解释。专注于线性因素解释(因素 $\times$ 值 = 结果),我们引入了递增式XAI,通过提供基线+递增因素来帮助用户阅读和记忆更准确的解释。通过重用基因素并减少异常情况中显示的因子数量,可以提高记忆性。在建模、形成性以及总结性用户研究中,我们评估了递增式XAI相对于基线解释方法的 faithfulness、memorability 和understandability。这项工作为用户能够更好地理解和内置AI提供了更有用的解释,从而促进了用户与AI的直觉性互动。
https://arxiv.org/abs/2404.06733
As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
随着大型自然语言处理模型(LLMs)将自然语言处理的力量扩展到处理长输入,进行严谨和系统的分析以了解其能力和行为是必要的。一个显著的应用是总结,因为它的普遍性和争议(例如,研究人员宣称总结已经过时了)。在本文中,我们使用财务报告总结作为一个案例研究,因为财务报告不仅很长,而且使用大量的数字和表格。我们提出了一个计算框架来表征多模态长形式总结,并研究了Claude 2.0/2.1,GPT-4/3.5和Command的行为。我们发现,GPT-3.5和Command无法以有意义的方式完成总结任务。对于Claude 2和GPT-4,我们分析总结的提取性,并指出LLMs中存在的位置偏见。这种位置偏见在Shuffle输入后消失,这表明Claude具有识别重要信息的能力。我们还对LLM生成的总结中使用数字数据进行了全面调查,并为数字异端行为提供了一个分类。我们采用提示工程来提高GPT-4在有限成功情况下使用数字的能力。总的来说,我们的分析突出了Claude 2在处理长多模态输入方面的强大能力与GPT-4之间的显著差异。
https://arxiv.org/abs/2404.06162
While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at this https URL.
虽然Vision-Language模型(VLMs)在迁移学习方面表现出色,但它们由于参数数量众多,导致计算成本较高。为解决这一问题,通过模型剪裁删除参数是一种可行的解决方案。然而,现有的VLM技术针对具体任务,因此需要从零开始网络剪裁以处理感兴趣的新任务。在这项工作中,我们探讨了一个新的方向:任务无关的Vision-Language剪裁(TA-VLP)。给定预训练的VLM,目标是从多个未知的下游任务中找到一个独特的修剪后可传输的对照转移表示。在具有挑战性的设置中,预训练模型中已经编码的转移表示是关键,以保留这种转移表示。因此,我们提出了Multimodal Flow Pruning(MULTIFLOW)作为TA-VLP第一个无梯度免费的剪裁框架: (i)参数的重要性用其大小和信息流表示,通过融入其连接的神经元的置信度来表示; (ii)剪裁是由预训练VLM参数的涌现(多模态)分布驱动的。 我们在TA-VLP的背景下对八个最先进的剪裁算法进行了基准测试,尝试了两种VLM和三个视觉语言任务,并研究了三个剪裁比。我们的实验结果表明,MULTIFLOW在大多数情况下都超过了最近最先进的组合剪裁竞争对手,为解决TA-VLP铺平道路。代码公开在https://这个URL上。
https://arxiv.org/abs/2404.05621
In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is architected to synchronize potential known class samples across both the labeled (source) and unlabeled (target) datasets, while emphasizing the distinct categorization of the target data. To facilitate this, we propose an entropy-driven adversarial learning strategy that accounts for the distance distributions of target samples relative to source-domain class prototypes. Parallelly, the discriminative nature of the shared space is upheld through a fusion of three metric learning objectives. In the source domain, our focus is on refining the proximity between samples and their affiliated class prototypes, while in the target domain, we integrate a neighborhood-centric contrastive learning mechanism, enriched with an adept neighborsmining approach. To further accentuate the nuanced feature interrelation among semantically aligned images, we champion the concept of conditional image inpainting, underscoring the premise that semantically analogous images prove more efficacious to the task than their disjointed counterparts. Experimentally, CDAD-NET eclipses existing literature with a performance increment of 8-15% on three AD-GCD benchmarks we present.
在通用类别发现(GCD)中,我们通过利用已知类别的训练数据对已知和未知的类别进行聚类。由于这些数据之间的领域转移,一个显著的挑战出现了。为了应对这个问题,我们提出了一个新场景:跨领域通用类别发现(AD-GCD)和类域发现器跨领域(CDAD-NET)作为解决方法。CDAD-NET旨在同步已知类别的样本在已知类别的(源)和未知的(目标)数据集中的潜在样本,同时强调目标数据的独特分类。为了促进这一目标,我们提出了一个熵驱动的对抗学习策略,考虑了目标样本与源领域类别原型之间的距离分布。同时,通过融合三个度量学习目标来维持共享空间判别性的特征。在源领域,我们的关注点是改进样本与相关类别原型之间的接近程度,而在目标领域,我们引入了一种以邻域为中心的对比学习机制,并支持一个智能邻居挖掘方法。为了进一步强调语义对齐图像之间细微特征之间的关联,我们倡导条件图像修复这一概念,强调语义类似于图像证明比它们的离散对应物更有效地完成任务。在实验中,CDAD-NET在三个AD-GCD基准测试上的性能提高了8-15%。
https://arxiv.org/abs/2404.05366
Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.
近年来,将预训练的视觉 transformer 适应各种下游任务显示出巨大的潜力。然而,大多数现有解决方案在每层独立优化提示,从而忽视了提示词在层间编码的任务相关信息。此外,现有的提示结构容易受到输入图像中与任务无关的噪声的干扰,这会损害任务相关信息的共享。在本文中,我们提出了一个新颖的 VPT 方法:iVPT。它创新地引入了跨层动态连接(CDC)来共享相邻层输入提示的 task-relevant information,实现有效共享任务相关信息。此外,我们还设计了一个动态聚合(DA)模块,促进层间信息的选择性共享。CDC 和 DA 的结合增强了 VPT 框架内注意过程的灵活性。在此基础上,iVPT 引入了注意性的强化(AR)机制,通过自适应地识别显眼的图像词元,进一步通过提示词进行增强。在 24 个图像分类和语义分割基准上的实验表明,与最先进的对照相比,iVPT 具有显著的优势。
https://arxiv.org/abs/2404.05207
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
视觉里程计(VO)对于自主系统的导航至关重要,它可以在合理的成本下提供准确的定位和方向估计。虽然传统的VO方法在某些情况下表现出色,但它们在多变的光线和运动模糊等情况下遇到了挑战。基于深度学习的VO虽然更具有适应性,但在新的环境中可能会面临泛化问题。为了解决这些缺点,本文提出了一种新颖的混合视觉里程计(VO)框架,该框架利用姿态仅监督,提供了一种平衡的解决方案,即稳健性和大量标注的必要性。我们提出了两种成本效益和创新的设想:自监督同构预训练以增强姿态仅标签的光学流学习,以及基于随机补丁的显着点检测策略,用于更准确的光学流补丁提取。这些设计消除了训练和密集光学流标签的需要,显著提高了系统在多样和具有挑战性的环境中的泛化能力。我们姿态仅监督的方法在标准数据集上实现了与先进密集光学流监督方法的竞争性能,在极端和未见场景中具有更大的鲁棒性和泛化能力,即使与密集光学流监督方法相比也是如此。
https://arxiv.org/abs/2404.04677
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However, standard gradient-based interpretation maps, including the simple gradient and integrated gradient algorithms, often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A frequently used approach to inducing sparsity structures into gradient-based saliency maps is to alter the simple gradient scheme using sparsification or norm-based regularization. A drawback with such post-processing methods is their frequently-observed significant loss in fidelity to the original simple gradient map. In this work, we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We show a duality relation between the regularized norms of the adversarial perturbations and gradient-based maps, based on which we design adversarial training loss functions promoting sparsity and group-sparsity properties in simple gradient maps. We present several numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets.
基于梯度的显着性图已被广泛用于解释深度神经网络分类器的决策。然而,标准的基于梯度的解释图,包括简单的梯度和集成梯度算法,通常缺乏其在现实世界计算机视觉模型上的所需的结构,如稀疏性和连通性。一种常用的将稀疏结构诱导到基于梯度的显着性图的方法是使用稀疏化或基于规范的 Regularization。然而,这种后处理方法经常观察到对原始简单梯度图的保真度显著下降。在本文中,我们将 adversarial 训练作为一种加工方案应用于具有结构化简单梯度图的神经网络的训练中。我们基于规范的梯度扰动的有界性和梯度-基于地图的稀疏性和群稀疏性性质,设计了一种促进简单梯度图稀疏性和群稀疏性特性的 adversarial 训练损失函数。我们提供了几个数值结果,以展示我们的基于规范的 adversarial 训练方法对标准神经网络架构标准梯度-基于地图的显着性图的影响。
https://arxiv.org/abs/2404.04647
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
目前存在两种极端观点用于神经网络特征学习:(i)神经网络简单地实现核方法(类似于NTK),因此没有特征被学习;(ii)神经网络可以表示(因此可以学习)适合数据的有层次特征。在本文中,我们认为基于一种新颖的观点,这两种解释都不太可能正确。神经网络可以看作是一个专家的混合,每个专家对应于一个(层数长度)通过隐藏单元的序列路径。我们使用这种替代解释来激励一个模型,称为深度线性有门网络(DLGN),该模型处于深度线性网络和ReLU网络之间。与深度线性网络不同,DLGN能够学习非线性特征(然后将这些特征进行线性组合),与ReLU网络不同,这些特征最终是简单的——每个特征实际上是输入空间中(层数)半空间的指示函数。这种观点允许对特征进行全面的全局可视化,而不仅仅是基于局部可视化对神经元的可视化。DLGN中的特征学习已经被证明是存在的,而且是通过在输入空间中学习包含目标函数平滑区域的半空间来实现的。由于DLGN的结构,后层的神经元与前层的神经元本质上相同——它们都代表一个半空间——然而,梯度下降的动态使后层神经元的聚类特征更加明显。我们假设ReLU网络也具有类似特征学习行为。
https://arxiv.org/abs/2404.04312
Abstractive summarization for long-form narrative texts such as movie scripts is challenging due to the computational and memory constraints of current language models. A movie script typically comprises a large number of scenes; however, only a fraction of these scenes are salient, i.e., important for understanding the overall narrative. The salience of a scene can be operationalized by considering it as salient if it is mentioned in the summary. Automatically identifying salient scenes is difficult due to the lack of suitable datasets. In this work, we introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies. We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes. Using QA-based evaluation, we show that our model outperforms previous state-of-the-art summarization methods and reflects the information content of a movie more accurately than a model that takes the whole movie script as input.
摘要:对于长篇叙事文本(如电影剧本),由于当前语言模型的计算和内存限制,进行抽象性总结是具有挑战性的。电影剧本通常包含大量场景,但只有少数场景具有显著性,即对理解整体故事具有关键作用。场景的显著性可以通过考虑它在摘要中是否提到来操作化。自动识别显著性场景很难,因为缺乏合适的数据集。在这项工作中,我们引入了一个场景显著性数据集,包括100部电影的 human-annotated 显著性场景。我们提出了一种两阶段抽象性总结方法,首先在剧本中识别显著性场景,然后仅使用这些场景生成摘要。使用基于问题的评估,我们证明了我们的模型比以往最先进的总结方法表现更好,更能准确地反映电影的内涵。
https://arxiv.org/abs/2404.03561
Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. While these methods can reflect the model's reasoning, they may not align with human intuition, making the explanations not plausible. In this work, we present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models. This incorporation enhances the plausibility of post-hoc explanations while preserving their faithfulness. Our approach is agnostic to model architectures and explainability methods. We introduce the rationales during model training by augmenting the standard cross-entropy loss with a novel loss function inspired by contrastive learning. By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility. Through extensive experiments involving diverse models, datasets, and explainability methods, we demonstrate that our approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model's performance.
为了更好地理解日益复杂的自然语言处理(NLP)模型,解释性后验方法是重要的工具。然而,这些方法可能不会与人类直觉保持一致,导致解释不可信。在这项工作中,我们提出了一种将理据(文本注释,解释人类决策)纳入文本分类模型中的方法。这种纳入在保留后验解释的准确性的同时提高了后验解释的可信度。我们的方法对模型架构和解释性方法持中立态度。我们通过在模型训练期间增加一种新型的损失函数(受到对比学习启发的交叉熵损失)来引入理据。通过利用多目标优化算法,我们探讨了两个损失函数之间的权衡,并生成了在性能和可信度之间达到帕累托最优的模型的Pareto最优前沿。通过涉及各种模型、数据集和解释性方法的大型实验,我们证明了我们的方法可以显著提高模型的解释质量,而不会导致对原始模型性能的实质性(有时微不足道的)下降。
https://arxiv.org/abs/2404.03098
Recent advancements in video saliency prediction (VSP) have shown promising performance compared to the human visual system, whose emulation is the primary goal of VSP. However, current state-of-the-art models employ spatio-temporal transformers trained on limited amounts of data, hindering generalizability adaptation to downstream tasks. The benefits of vision foundation models present a potential solution to improve the VSP process. However, adapting image foundation models to the video domain presents significant challenges in modeling scene dynamics and capturing temporal information. To address these challenges, and as the first initiative to design a VSP model based on video foundation models, we introduce SalFoM, a novel encoder-decoder video transformer architecture. Our model employs UnMasked Teacher (UMT) as feature extractor and presents a heterogeneous decoder which features a locality-aware spatio-temporal transformer and integrates local and global spatio-temporal information from various perspectives to produce the final saliency map. Our qualitative and quantitative experiments on the challenging VSP benchmark datasets of DHF1K, Hollywood-2 and UCF-Sports demonstrate the superiority of our proposed model in comparison with the state-of-the-art methods.
近年来,在视频突显预测(VSP)方面的先进进展已经表明,与人类视觉系统相比,其模拟是VSP的主要目标,具有良好的性能。然而,当前最先进的模型使用有限数据训练的时空变换器,阻碍了向下游任务的泛化适应。视觉基础模型的优势为改进VSP过程提供了潜在解决方案。然而,将图像基础模型适应视频领域在建模场景动态和捕捉时间信息方面带来了巨大的挑战。为了应对这些挑战,作为基于视频基础模型的VSP模型的第一个设计,我们引入了SalFoM,一种新颖的编码器-解码器视频转换器架构。我们的模型采用去遮罩教师(UMT)作为特征提取器,并呈现了一个具有局部感知时空变换器的异构解码器,从各种角度集成局部和全局时空信息,产生最终的视频突显图。我们对具有挑战性的VSP基准数据集DHF1K、Hollywood-2和UCF-Sports的实验结果表明,与最先进的方法相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2404.03097
In this work, we investigate methods to reduce the noise in deep saliency maps coming from convolutional downsampling, with the purpose of explaining how a deep learning model detects tumors in scanned histological tissue samples. Those methods make the investigated models more interpretable for gradient-based saliency maps, computed in hidden layers. We test our approach on different models trained for image classification on ImageNet1K, and models trained for tumor detection on Camelyon16 and in-house real-world digital pathology scans of stained tissue samples. Our results show that the checkerboard noise in the gradient gets reduced, resulting in smoother and therefore easier to interpret saliency maps.
在这项研究中,我们研究了如何减少来自卷积下采样的深度突出度图中的噪声,以解释深度学习模型如何通过扫描组织样本检测肿瘤。这些方法使得研究中的模型对于基于梯度的突出度图更加可解释,该突出度图是在隐藏层计算的。我们在ImageNet1K上训练的不同模型以及为肿瘤检测在Camelyon16和玻片组织样本的体内真实世界数字病理学扫描上训练的模型上测试我们的方法。我们的结果表明,梯度中的检查员噪声减少,导致更平滑,因此更容易解释的突出度图。
https://arxiv.org/abs/2404.02282
In deep reinforcement learning (RL) research, there has been a concerted effort to design more efficient and productive exploration methods while solving sparse-reward problems. These exploration methods often share common principles (e.g., improving diversity) and implementation details (e.g., intrinsic reward). Prior work found that non-stationary Markov decision processes (MDPs) require exploration to efficiently adapt to changes in the environment with online transfer learning. However, the relationship between specific exploration characteristics and effective transfer learning in deep RL has not been characterized. In this work, we seek to understand the relationships between salient exploration characteristics and improved performance and efficiency in transfer learning. We test eleven popular exploration algorithms on a variety of transfer types -- or ``novelties'' -- to identify the characteristics that positively affect online transfer learning. Our analysis shows that some characteristics correlate with improved performance and efficiency across a wide range of transfer tasks, while others only improve transfer performance with respect to specific environment changes. From our analysis, make recommendations about which exploration algorithm characteristics are best suited to specific transfer situations.
在深度强化学习(RL)研究中,设计更高效和生产力强的探索方法以解决稀疏奖励问题一直是一个专注的努力。这些探索方法通常具有共同的原理(例如,提高多样性)和实现细节(例如,内生奖励)。先前的研究发现在使用在线迁移学习有效地适应环境变化的情况下,具有非平稳随机决策过程(MDPs)需要探索。然而,深度RL中显着探索特征与有效迁移学习之间的关系尚未被描述清楚。在本文中,我们试图了解显着探索特征与迁移学习效果之间的关系。我们测试了十一种流行的探索算法在各种迁移类型(或“新奇事物”)上的效果,以确定哪些特征对在线迁移学习的效果有积极影响。我们的分析显示,有些特征在广泛的迁移任务上与提高表现和效率相关,而另一些只与特定环境变化相关的迁移表现有关。从我们的分析中,关于哪种探索算法特性最适合特定迁移情况的建议。
https://arxiv.org/abs/2404.02235
This study proposes a novel transfer learning framework for effective ship classification using high-resolution optical remote sensing satellite imagery. The framework is based on the deep convolutional neural network model ResNet50 and incorporates the Convolutional Block Attention Module (CBAM) to enhance performance. CBAM enables the model to attend to salient features in the images, allowing it to better discriminate between subtle differences between ships and backgrounds. Furthermore, this study adopts a transfer learning approach tailored for accurately classifying diverse types of ships by fine-tuning a pre-trained model for the specific task. Experimental results demonstrate the efficacy of the proposed framework in ship classification using optical remote sensing imagery, achieving a high classification accuracy of 94% across 5 classes, outperforming existing methods. This research holds potential applications in maritime surveillance and management, illegal fishing detection, and maritime traffic monitoring.
这项研究提出了一种新颖的迁移学习框架,用于使用高分辨率光学遥感卫星图像有效地对船只进行分类。该框架基于深度卷积神经网络模型ResNet50,并采用了卷积块注意模块(CBAM)来提高性能。CBAM使模型能够关注图像中的显着特征,从而更好地区分船只和背景之间的微妙差异。此外,本研究采用了一种针对准确分类各种船只的迁移学习方法,通过微调预训练模型来适应特定任务。实验结果表明,所提出的框架在光学遥感图像中进行船只分类的有效性,达到5个类别的高分类准确率94%,超过了现有方法。这项研究具有在海上监视和管理、非法捕鱼检测和海上交通监测等领域潜在应用的价值。
https://arxiv.org/abs/2404.02135
CAM-based methods are widely-used post-hoc interpretability method that produce a saliency map to explain the decision of an image classification model. The saliency map highlights the important areas of the image relevant to the prediction. In this paper, we show that most of these methods can incorrectly attribute an important score to parts of the image that the model cannot see. We show that this phenomenon occurs both theoretically and experimentally. On the theory side, we analyze the behavior of GradCAM on a simple masked CNN model at initialization. Experimentally, we train a VGG-like model constrained to not use the lower part of the image and nevertheless observe positive scores in the unseen part of the image. This behavior is evaluated quantitatively on two new datasets. We believe that this is problematic, potentially leading to mis-interpretation of the model's behavior.
基于CAM的方法是在图像分类模型的后置可解释方法中广泛使用的,该方法产生一个局部的重要性图,以解释图像分类模型的决策。局部重要性图突出了与预测相关的图像重要区域。在本文中,我们证明了大多数这些方法会将一个模型无法看到的图像部分的重要分数错误地分配给模型。我们还证明了这种现象在理论和实验中都存在。在理论方面,我们对GradCAM在初始化时对简单掩码CNN模型的行为进行了分析。实验方面,我们训练了一个类似于VGG的模型,该模型被约束不要使用图像的较低部分,然而在未见过的图像部分观察到了积极的分数。这种行为在两个新的数据集上进行了定量评估。我们认为这个问题很严重,可能导致模型行为的误解。
https://arxiv.org/abs/2404.01964
Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
多模态和大型语言模型(LLMs)已经彻底颠覆了开放世界知识的利用方式,解锁了各种任务和应用中的新潜力。在这些领域中,视频领域尤其从它们的功能中受益。在本文中,我们提出了 Highlight-CLIP(HL-CLIP)方法,通过利用多模态模型中预训练的知识,专门针对视频突出检测任务进行了优化。通过简单地将多模态编码器与我们的创新突显池化技术相结合,我们在突出检测任务、QVHighlight 基准测试中达到了最先进的性能。据我们所知,这是目前最好的结果。
https://arxiv.org/abs/2404.01745
Remote sensing target detection aims to identify and locate critical targets within remote sensing images, finding extensive applications in agriculture and urban planning. Feature pyramid networks (FPNs) are commonly used to extract multi-scale features. However, existing FPNs often overlook extracting low-level positional information and fine-grained context interaction. To address this, we propose a novel location refined feature pyramid network (LR-FPN) to enhance the extraction of shallow positional information and facilitate fine-grained context interaction. The LR-FPN consists of two primary modules: the shallow position information extraction module (SPIEM) and the contextual interaction module (CIM). Specifically, SPIEM first maximizes the retention of solid location information of the target by simultaneously extracting positional and saliency information from the low-level feature map. Subsequently, CIM injects this robust location information into different layers of the original FPN through spatial and channel interaction, explicitly enhancing the object area. Moreover, in spatial interaction, we introduce a simple local and non-local interaction strategy to learn and retain the saliency information of the object. Lastly, the LR-FPN can be readily integrated into common object detection frameworks to improve performance significantly. Extensive experiments on two large-scale remote sensing datasets (i.e., DOTAV1.0 and HRSC2016) demonstrate that the proposed LR-FPN is superior to state-of-the-art object detection approaches. Our code and models will be publicly available.
遥感目标检测旨在在遥感图像中识别和定位关键目标,在农业和城市规划等领域具有广泛应用。常用的特征金字塔网络(FPN)通常用于提取多尺度特征。然而,现有的FPN往往忽视提取低级位置信息和高精度上下文交互。为了解决这个问题,我们提出了一个新颖的定位优化特征金字塔网络(LR-FPN),以增强浅层位置信息的提取和促进高精度上下文交互。LR-FPN由两个主要模块组成:浅层位置信息提取模块(SPIEM)和上下文交互模块(CIM)。具体来说,SPIEM首先通过同时提取低级特征图的定位和轮廓信息来最大化目标的保留位置信息。然后,CIM通过空间和通道交互将这种稳健的位置信息注入原始FPN的不同层中,明显增强对象的面积。此外,在空间交互中,我们引入了一种简单的地方和非地方交互策略,用于学习和保留对象的轮廓信息。最后,LR-FPN可以轻松地集成到常见的物体检测框架中,显著提高性能。在两个大型遥感数据集(即DOTAV1.0和HRSC2016)上的大量实验证明,与最先进的物体检测方法相比,LR-FPN具有优越性。我们的代码和模型将公开发布。
https://arxiv.org/abs/2404.01614
The rapid adoption of Electronic Health Records (EHRs) has been instrumental in streamlining administrative tasks, increasing transparency, and enabling continuity of care across providers. An unintended consequence of the increased documentation burden, however, has been reduced face-time with patients and, concomitantly, a dramatic rise in clinician burnout. In this thesis, we pinpoint a particularly time-intensive, yet critical, documentation task: generating a summary of a patient's hospital admissions, and propose and evaluate automated solutions. In Chapter 2, we construct a dataset based on 109,000 hospitalizations (2M source notes) and perform exploratory analyses to motivate future work on modeling and evaluation [NAACL 2021]. In Chapter 3, we address faithfulness from a modeling perspective by revising noisy references [EMNLP 2022] and, to reduce the reliance on references, directly calibrating model outputs to metrics [ACL 2023]. These works relied heavily on automatic metrics as human annotations were limited. To fill this gap, in Chapter 4, we conduct a fine-grained expert annotation of system errors in order to meta-evaluate existing metrics and better understand task-specific issues of domain adaptation and source-summary alignments. To learn a metric less correlated to extractiveness (copy-and-paste), we derive noisy faithfulness labels from an ensemble of existing metrics and train a faithfulness classifier on these pseudo labels [MLHC 2023]. Finally, in Chapter 5, we demonstrate that fine-tuned LLMs (Mistral and Zephyr) are highly prone to entity hallucinations and cover fewer salient entities. We improve both coverage and faithfulness by performing sentence-level entity planning based on a set of pre-computed salient entities from the source text, which extends our work on entity-guided news summarization [ACL, 2023], [EMNLP, 2023].
快速采用电子病历(EHRs)确实有助于简化管理任务,提高透明度,并使护理服务的连续性得以实现。然而,增加的文档负担也减少了对患者的面对面接触,与此同时,医生们的工作负担也急剧增加。在这篇论文中,我们重点指出一个尤其耗时的、但关键的文档任务:生成患者医院的入院摘要。然后我们提出并评估了自动解决方案。在第二章,我们基于109,000次住院(2M源笔记)构建了一个数据集,并进行了探索性分析以激发未来关于建模和评估的工作[NAACL 2021]。在第三章,我们从建模角度解决了数据的不确定性,通过修改嘈杂的参考文献[EMNLP 2022],并直接将模型输出与指标进行校准以减少参考的依赖[ACL 2023]。这些工作在自动指标方面依赖很大,因为人类注释有限。为了填补这一空白,在第四章,我们进行了对系统错误的精细专家注释,以元评估现有指标,并更好地理解领域适应问题和源摘要对齐的任务特定问题。为了学习一个与提取性(复制和粘贴)无关的指标,我们从现有的指标集合中提取噪声标签,并训练一个信噪比分类器对这些伪标签[MLHC 2023]。最后,在第五章,我们证明了经过微调的LLMs(Mistral和Zephyr)容易产生实体错觉,并且覆盖的显著实体较少。通过基于源文本预计算的显著实体进行句子级别实体规划,我们提高了覆盖率和信噪比,这是对我们相关工作在实体指导新闻摘要[ACL, 2023],[EMNLP, 2023]的扩展。
https://arxiv.org/abs/2404.01189
Temporal video grounding (TVG) is a critical task in video content understanding. Despite significant advancements, existing methods often limit in capturing the fine-grained relationships between multimodal inputs and the high computational costs with processing long video sequences. To address these limitations, we introduce a novel SpikeMba: multi-modal spiking saliency mamba for temporal video grounding. In our work, we integrate the Spiking Neural Networks (SNNs) and state space models (SSMs) to capture the fine-grained relationships of multimodal features effectively. Specifically, we introduce the relevant slots to enhance the model's memory capabilities, enabling a deeper contextual understanding of video sequences. The contextual moment reasoner leverages these slots to maintain a balance between contextual information preservation and semantic relevance exploration. Simultaneously, the spiking saliency detector capitalizes on the unique properties of SNNs to accurately locate salient proposals. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
时空视频 grounded (TVG) 是视频内容理解中的一个关键任务。尽管取得了一定的进展,但现有的方法通常在捕捉多模态输入与长视频序列的高计算成本之间存在局限。为了应对这些限制,我们引入了一种新颖的 SpikeMba:多模态尖峰突触自组织映射用于时空视频 grounded。在我们的工作中,我们将Spiking Neural Networks(SNNs) 和状态空间模型(SSMs)集成起来,有效地捕捉多模态特征之间的细粒度关系。具体来说,我们引入相关的插槽来增强模型的记忆能力,从而实现对视频序列的更深的上下文理解。上下文推理器利用这些插槽来保持上下文信息保留和语义相关性探索之间的平衡。同时,尖峰突触检测利用SNN的独特属性准确地定位突出建议。我们的实验证明了SpikeMba的有效性,其在主流基准测试中 consistently超越了最先进的方法。
https://arxiv.org/abs/2404.01174
This paper presents a fresh perspective on the role of saliency maps in weakly-supervised semantic segmentation (WSSS) and offers new insights and research directions based on our empirical findings. We conduct comprehensive experiments and observe that the quality of the saliency map is a critical factor in saliency-guided WSSS approaches. Nonetheless, we find that the saliency maps used in previous works are often arbitrarily chosen, despite their significant impact on WSSS. Additionally, we observe that the choice of the threshold, which has received less attention before, is non-trivial in WSSS. To facilitate more meaningful and rigorous research for saliency-guided WSSS, we introduce \texttt{WSSS-BED}, a standardized framework for conducting research under unified conditions. \texttt{WSSS-BED} provides various saliency maps and activation maps for seven WSSS methods, as well as saliency maps from unsupervised salient object detection models.
本文从新的角度探讨了显著图在弱监督语义分割(WSSS)中的作用,并基于我们的实证研究提供了新的见解和研究方向。我们进行了全面的实验,并观察到显著图的质量对显著引导的WSSS方法至关重要。然而,我们发现之前的工作使用的显著图通常是随意选择的,尽管它们对WSSS具有重要影响。此外,我们观察到在WSSS中选择阈值是一个非 trivial的问题。为了促进更有意义和严谨的研究,我们引入了\texttt{WSSS-BED},一个在统一条件下进行研究的标准框架。\texttt{WSSS-BED}为七个WSSS方法提供了各种显著图和激活图,以及来自无监督显著物体检测模型的显著图。
https://arxiv.org/abs/2404.00918