Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大型语言模型(LLMs)可以通过提示技术在领域之间生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,制定有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献,更完整。关键词的数量可以控制精度和召回之间的权衡。此外,我们的分析发现,在LLM上引入显眼的语句级别信息要优于词或句子级别。然而,对幻觉的影响并非普遍积极。为了进行这项分析,我们引入了Keyphrase Signal Extractor(CriSPO),一种轻量级模型,可以微调以提取显眼的关键词。通过使用CriSPO,我们在数据集上实现了一致的ROUGE提高,同时在不进行LLM自定义的情况下使用开放式权重和专有LLM。我们的研究结果为利用显眼信息构建基于提示的摘要系统提供了见解。
https://arxiv.org/abs/2410.02748
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大规模语言模型(LLMs)可以通过提示技术在跨领域生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,创建有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献和更完整。关键词的数量可以控制精度-召回权衡。此外,我们的分析发现,在LLM上包含显眼的语义信息优于仅包含单词或句子级别的信息。然而,对于不同的LLM,影响幻觉的程度并不是普遍积极的。为了进行这项分析,我们引入了Keyphrase Signal Extractor(SigExt),一种轻量级的模型,可以微调以提取显眼的关键词。通过使用SigExt,我们在数据集上实现了一致的ROUGE提高,同时使用开箱即用的LLM和非LLM定制版本。我们的研究结果提供了关于利用显眼信息构建基于提示的摘要系统的见解。
https://arxiv.org/abs/2410.02741
Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose \textbf{C}urvature \textbf{D}iversity-Driven \textbf{N}uclear-Norm Wasserstein \textbf{D}omain Alignment (CDND). Our approach first introduces a \textit{\textbf{Curv}ature Diversity-driven Deformation \textbf{Rec}onstruction (CurvRec)} task, which effectively mitigates the gap between the source and target domains by enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose \textit{\textbf{D}eformation-based \textbf{N}uclear-norm \textbf{W}asserstein \textbf{D}iscrepancy (D-NWD)}, which applies the Nuclear-norm Wasserstein Discrepancy to both \textit{deformed and original} data samples to align the source and target domains. Furthermore, we contribute a theoretical justification for the effectiveness of D-NWD in distribution alignment and demonstrate that it is \textit{generic} enough to be applied to \textbf{any} deformations. To validate our method, we conduct extensive experiments on two public domain adaptation datasets for point cloud classification and segmentation tasks. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.
无需大量手动数据注释,训练深度网络在点云数据上具有监督学习(UDA)对减少在训练过程中需要的大规模手动数据注释的需求至关重要。UDA的一个关键挑战是有效地弥合领域差距。为解决这一挑战,我们提出了一种名为 \textbf{C}urvature \textbf{D}iversity-Driven \textbf{N}uclear-Norm Wasserstein \textbf{D}domain Alignment (CDND) 的方法。我们的方法首先引入了一个 \textit{\textbf{Curv}ature Diversity-driven Deformation \textbf{Rec}onstruction (CurvRec)} 任务,通过使模型从给定点云中的语义丰富区域提取显著特征,有效缓解了源域和目标域之间的差距。然后,我们提出了一个 \textit{\textbf{D}eformation-based \textbf{N}uclear-norm \textbf{W}asserstein \textbf{D}iscrepancy (D-NWD)}$$ 方法,将核范数 Wasserstein 距离应用于所有变形和原始数据样本,以对源域和目标域进行对齐。此外,我们还提供了 D-NWD 在分布对齐的有效性理论证明,并表明它足够通用,可以应用于任何变形。为了验证我们的方法,我们在两个公开点云分类和分割数据集上进行了广泛的实验。实验结果表明,我们的 CDND 通过显著的领先优势超越了现有方法,实现了最先进的性能。
https://arxiv.org/abs/2410.02720
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in the real world construction zones. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU-RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The codes and dataset are available at: this https URL
检测人类行为是自动驾驶机器人车辆的关键任务,通常需要将各种数据模式进行集成以提高准确性。在这项研究中,我们提出了一种基于骨架和视觉线索的新人机行为识别(HAR)方法。我们的方法利用语言模型引导骨架编码器的特征提取过程。具体来说,我们使用条件于骨架模态的learnable prompts来优化特征表示。此外,我们提出了一种融合机制,使用显著性融合模块结合注意力和Transformer机制来处理模态的高维度。这个融合过程优先考虑视频帧和身体关节的有用信息,提高了人类行为的识别准确性。此外,我们还引入了一个新的数据集,名为VolvoConstAct,专门针对实境建筑工地进行设计,包括视觉、骨架和深度数据模式。这个数据集有助于指导自主建筑机器人在现实世界的建筑区执行必要任务。为了评估我们的方法,我们在我们的数据集以及三个广泛使用的主流公共数据集(NTU-RGB+D,NTU-RGB+D120和NW-UCLA)上进行了实验。结果表明,我们提出的方法在所有数据集上都取得了良好的性能,证明了其稳健性和各种应用的前景。代码和数据集可在此处访问:https://this URL
https://arxiv.org/abs/2410.01962
Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.
现有的视频时刻检索和特征检测方法无法有效地将文本和视频特征对齐,导致不满意的性能和有限的生产利用率。为解决这个问题,我们提出了一种新颖的架构,它利用了为这种对齐而设计的最近发现的视频模型。结合引入的Saliency引导跨注意机制和混合DETR架构,我们的方法在时刻检索和特征检测任务上显著增强了性能。为了获得更好的改进,我们开发了InterVid-MR,一个大规模高质量的预训练数据集。利用它,我们的架构在QVHighlights、Charades-STA和TACoS基准上实现了最先进的成果。所提出的方案为视频语言任务中的零散和微调场景提供了高效且可扩展的解决方案。
https://arxiv.org/abs/2410.01615
Despite the growing use of deep neural networks in safety-critical decision-making, their inherent black-box nature hinders transparency and interpretability. Explainable AI (XAI) methods have thus emerged to understand a model's internal workings, and notably attribution methods also called saliency maps. Conventional attribution methods typically identify the locations -- the where -- of significant regions within an input. However, because they overlook the inherent structure of the input data, these methods often fail to interpret what these regions represent in terms of structural components (e.g., textures in images or transients in sounds). Furthermore, existing methods are usually tailored to a single data modality, limiting their generalizability. In this paper, we propose leveraging the wavelet domain as a robust mathematical foundation for attribution. Our approach, the Wavelet Attribution Method (WAM) extends the existing gradient-based feature attributions into the wavelet domain, providing a unified framework for explaining classifiers across images, audio, and 3D shapes. Empirical evaluations demonstrate that WAM matches or surpasses state-of-the-art methods across faithfulness metrics and models in image, audio, and 3D explainability. Finally, we show how our method explains not only the where -- the important parts of the input -- but also the what -- the relevant patterns in terms of structural components.
尽管在安全决策中深度神经网络的使用不断增加,但它们固有的黑盒性质阻碍了透明度和可解释性。因此,出现了可解释人工智能(XAI)方法来了解模型的内部工作原理,尤其是称为显着度图的归因方法。传统的归因方法通常确定输入中的显著区域的位置(例如图像中的纹理或声音中的暂态)。然而,因为它们忽视了输入数据的固有结构,这些方法通常无法解释这些区域在结构组件(如图像中的纹理或声音中的暂态)方面的含义。此外,现有的方法通常针对单一数据模式进行优化,限制了它们的泛化能力。在本文中,我们提出了利用波浪函数域作为稳健的数学基础进行归因。我们的方法,波浪函数归因方法(WAM),将现有的基于梯度的特征归因扩展到了波浪函数域,为解释图像、音频和3D形状的分类器提供了一个统一框架。 empirical evaluations表明,WAM在可靠性度量标准和模型在图像、音频和3D可解释性方面的先进方法相匹敌或者超过了这些方法。最后,我们证明了我们的方法不仅解释了输入中的位置(即重要部分),而且还解释了结构组件(即相关模式)。
https://arxiv.org/abs/2410.01482
The exploration of language skills in language models (LMs) has always been one of the central goals in mechanistic interpretability. However, existing circuit analyses often fall short in representing the full functional scope of these models, primarily due to the exclusion of Feed-Forward layers. Additionally, isolating the effect of a single language skill from a text, which inherently involves multiple entangled skills, poses a significant challenge. To address these gaps, we introduce a novel concept, Memory Circuit, a minimum unit that fully and independently manipulates the memory-reading functionality of a language model, and disentangle the transformer model precisely into a circuit graph which is an ensemble of paths connecting different memory circuits. Based on this disentanglement, we identify salient circuit paths, named as skill paths, responsible for three crucial language skills, i.e., the Previous Token Skill, Induction Skill and In-Context Learning (ICL) Skill, leveraging causal effect estimation through interventions and counterfactuals. Our experiments on various datasets confirm the correspondence between our identified skill paths and language skills, and validate three longstanding hypotheses: 1) Language skills are identifiable through circuit dissection; 2) Simple language skills reside in shallow layers, whereas complex language skills are found in deeper layers; 3) Complex language skills are formed on top of simpler language skills. Our codes are available at: this https URL.
语言模型(LMs)中语言技能的探索一直是机械可解释性的中心目标之一。然而,现有的电路分析通常不足以代表这些模型的完整功能范围,主要原因是它们排除了递归层。此外,从文本中隔离单个语言技能的效果,这本质上涉及多个交织的技能,提出了一个重大挑战。为了填补这些空白,我们引入了一个新概念——记忆电路,一个最小单位,可以完全独立地操作语言模型的记忆读取功能,并将变压器模型精确地分解为电路图,这是一个由连接不同记忆电路的路径组成的汇编图。基于这种解开,我们识别出三个关键的语言技能,称为技能路径,负责三种关键的语言技能,即前一个单词技能、归纳技能和上下文学习(ICL)技能,通过干预和反事实推理利用因果效应估计。我们对各种数据集的实验证实了我们的技能路径与语言技能之间的对应关系,并验证了三个长期假设:1)语言技能可以通过电路分解识别出来;2)简单的语言技能存在于浅层,而复杂的语言技能存在于深层;3)复杂的语言技能是在简单的语言技能之上形成的。我们的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2410.01334
Deep learning models often function as black boxes, providing no straightforward reasoning for their predictions. This is particularly true for computer vision models, which process tensors of pixel values to generate outcomes in tasks such as image classification and object detection. To elucidate the reasoning of these models, class activation maps (CAMs) are used to highlight salient regions that influence a model's output. This research introduces KPCA-CAM, a technique designed to enhance the interpretability of Convolutional Neural Networks (CNNs) through improved class activation maps. KPCA-CAM leverages Principal Component Analysis (PCA) with the kernel trick to capture nonlinear relationships within CNN activations more effectively. By mapping data into higher-dimensional spaces with kernel functions and extracting principal components from this transformed hyperplane, KPCA-CAM provides more accurate representations of the underlying data manifold. This enables a deeper understanding of the features influencing CNN decisions. Empirical evaluations on the ILSVRC dataset across different CNN models demonstrate that KPCA-CAM produces more precise activation maps, providing clearer insights into the model's reasoning compared to existing CAM algorithms. This research advances CAM techniques, equipping researchers and practitioners with a powerful tool to gain deeper insights into CNN decision-making processes and overall behaviors.
深度学习模型通常是一个黑盒子,无法提供模型预测的明确推理。这一说法尤其对于计算机视觉模型来说,这些模型处理数十个像素值的数据以生成图像分类和物体检测等任务的结果。为了阐明这些模型的推理,我们使用了类激活映射(CAMs),该技术通过提高类激活映射的清晰度来增强卷积神经网络(CNN)的可解释性。KPCA-CAM是一种通过改进类激活映射来增强CNN可解释性的技术。它利用内积和核技巧(kernel trick)与主成分分析(PCA)相结合,更有效地捕捉CNN激活之间的非线性关系。通过将数据映射到高维空间并提取这些变换超平面的主成分,KPCA-CAM提供了更准确的数据集表示。这使得我们能够更深入地理解影响CNN决策的特征。在ILSVRC数据集的不同CNN模型上进行实证评估表明,KPCA-CAM生成了更精确的激活图,使得我们能够更清楚地了解模型的推理过程。这项研究推动了类激活映射(CAM)技术的发展,为研究人员和专业人士提供了一个更强大的工具,以更深入地了解CNN决策过程和整体行为。
https://arxiv.org/abs/2410.00267
Identifying and classifying shutdown initiating events (SDIEs) is critical for developing low power shutdown probabilistic risk assessment for nuclear power plants. Existing computational approaches cannot achieve satisfactory performance due to the challenges of unavailable large, labeled datasets, imbalanced event types, and label noise. To address these challenges, we propose a hybrid pipeline that integrates a knowledge-informed machine learning mode to prescreen non-SDIEs and a large language model (LLM) to classify SDIEs into four types. In the prescreening stage, we proposed a set of 44 SDIE text patterns that consist of the most salient keywords and phrases from six SDIE types. Text vectorization based on the SDIE patterns generates feature vectors that are highly separable by using a simple binary classifier. The second stage builds Bidirectional Encoder Representations from Transformers (BERT)-based LLM, which learns generic English language representations from self-supervised pretraining on a large dataset and adapts to SDIE classification by fine-tuning it on an SDIE dataset. The proposed approaches are evaluated on a dataset with 10,928 events using precision, recall ratio, F1 score, and average accuracy. The results demonstrate that the prescreening stage can exclude more than 97% non-SDIEs, and the LLM achieves an average accuracy of 93.4% for SDIE classification.
确定和分类停机启动事件(SDIEs)对于开发低功耗停机概率风险评估核电站至关重要。现有的计算方法由于缺乏可用的大、有标签数据集、不平衡事件类型和标签噪音等原因,无法实现满意的性能。为解决这些挑战,我们提出了一个混合管道,将知识驱动的机器学习模式与预筛选非SDIEs的大型语言模型(LLM)相结合,将SDIE分为四种类型。在预筛选阶段,我们提出了44个SDIE文本模式,包括六个SDIE类型的最显眼关键词和短语。基于SDIE模式的文本向量化生成了具有简单二分类器的分离特征向量。第二阶段构建了基于Transformer的Bidirectional Encoder Representations(BERT)的LLM,它从大量数据中自监督预训练学习到通用英语语言表示,并通过在SDIE数据集上微调来适应SDIE分类。所提出的方法在评测包含10,928个事件的的数据集时进行评估,精度、召回率、F1分数和平均准确度。结果表明,预筛选阶段可以排除超过97%的非SDIE,LLM在SDIE分类上的平均准确度为93.4%。
https://arxiv.org/abs/2410.00929
Large Language Models (LLMs) have succeeded considerably in In-Context-Learning (ICL) based summarization. However, saliency is subject to the users' specific preference histories. Hence, we need reliable In-Context Personalization Learning (ICPL) capabilities within such LLMs. For any arbitrary LLM to exhibit ICPL, it needs to have the ability to discern contrast in user profiles. A recent study proposed a measure for degree-of-personalization called EGISES for the first time. EGISES measures a model's responsiveness to user profile differences. However, it cannot test if a model utilizes all three types of cues provided in ICPL prompts: (i) example summaries, (ii) user's reading histories, and (iii) contrast in user profiles. To address this, we propose the iCOPERNICUS framework, a novel In-COntext PERsonalization learNIng sCrUtiny of Summarization capability in LLMs that uses EGISES as a comparative measure. As a case-study, we evaluate 17 state-of-the-art LLMs based on their reported ICL performances and observe that 15 models' ICPL degrades (min: 1.6%; max: 3.6%) when probed with richer prompts, thereby showing lack of true ICPL.
大语言模型(LLMs)在基于上下文学习的摘要方面取得了显著的成功。然而,显著性取决于用户的个人偏好历史。因此,我们需要在LLMs中实现可靠的上下文个性化学习(ICPL)功能。对于任何任意LLM要展示ICPL,它需要能够区分用户概型。最近,一个研究首次提出了一个衡量个性化程度的指标,称为EGISES。EGISES衡量了一个模型对用户概型的响应程度。然而,它不能测试模型是否利用了ICPL提示中提供的所有三种提示:即(i)示例摘要,(ii) 用户的阅读历史,(iii) 用户概型的差异。为了解决这个问题,我们提出了iCOPERNICUS框架,一种新颖的LLM上下文个性化学习技能,它使用EGISES作为比较指标。作为一个案例研究,我们根据它们报道的ICPL性能对17个最先进的LLM进行了评估,观察到当使用更丰富的提示进行查询时,15个模型的ICPL会降低(最小值:1.6%;最大值:3.6%),从而表明缺乏真正的ICPL。
https://arxiv.org/abs/2410.00149
Convolutional neural networks (CNNs) achieve prevailing results in segmentation tasks nowadays and represent the state-of-the-art for image-based analysis. However, the understanding of the accurate decision-making process of a CNN is rather unknown. The research area of explainable artificial intelligence (xAI) primarily revolves around understanding and interpreting this black-box behavior. One way of interpreting a CNN is the use of class activation maps (CAMs) that represent heatmaps to indicate the importance of image areas for the prediction of the CNN. For classification tasks, a variety of CAM algorithms exist. But for segmentation tasks, only one CAM algorithm for the interpretation of the output of a CNN exist. We propose a transfer between existing classification- and segmentation-based methods for more detailed, explainable, and consistent results which show salient pixels in semantic segmentation tasks. The resulting Seg-HiRes-Grad CAM is an extension of the segmentation-based Seg-Grad CAM with the transfer to the classification-based HiRes CAM. Our method improves the previously-mentioned existing segmentation-based method by adjusting it to recently published classification-based methods. Especially for medical image segmentation, this transfer solves existing explainability disadvantages.
卷积神经网络(CNNs)在目前的分割任务中实现了一贯的成果,并代表了基于图像的分析 state-of-the-art。然而,对 CNN 的准确决策过程的理解仍然相当未知。可解释人工智能(xAI)的研究主要围绕理解并解释这种黑匣子行为展开。解释性 CNN 的解释的一种方法是使用类激活图(CAMs),这些图表示热图以表示图像区域对于 CNN 的预测的重要性。对于分类任务,存在多种 CAM 算法。但对于分割任务,只有用于解释 CNN 输出的一种 CAM 算法。我们提出了一种在现有分类和分割方法之间进行转移以获得更详细、可解释和一致结果的方法,这些结果在语义分割任务中表现出显著的像素。得到的 Seg-HiRes-Grad CAM 是基于分割的 Seg-Grad CAM 的扩展,通过将类基于分类的 HiRes CAM 进行转移。我们的方法通过调整最近发布的分类基于方法来改进之前提到的基于分割的方法。特别是对于医学图像分割,这种转移解决了现有的可解释性不足。
https://arxiv.org/abs/2409.20287
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
人类在理解视觉关系方面的能力远远优于AI系统,特别是对于之前未见过的物体。例如,AI系统在确定两个此类物体是否在视觉上相同或不同时会感到困惑,而人类则可以轻松地做到这一点。积极视觉理论认为,学习视觉关系是基于我们移动眼睛来固定物体及其部分的行为。特别是,关于相应眼动低维空间信息的假设,有助于促进不同图像部分之间的关系表示。受到这些理论的启发,我们开发了一种名为Glimpse-based Active Perception(GAP)的新系统,该系统在输入图像的最具突出性的区域进行序列性浏览,并对其进行高分辨率处理。重要的是,我们的系统利用浏览行动产生的位置以及它们周围的视觉内容来表示图像不同部分之间的关系。结果显示,GAP对于提取超越当前视觉内容的视觉关系至关重要。我们的方法在几个视觉推理任务上达到了最先进的性能,具有更高的样本效率,并且对分布不在前的模型的泛化更好。
https://arxiv.org/abs/2409.20213
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks. Approximate MU is a practical method for large-scale models. Our investigation into approximate MU starts with identifying the steepest descent direction, minimizing the output Kullback-Leibler divergence to exact MU inside a parameters' neighborhood. This probed direction decomposes into three components: weighted forgetting gradient ascent, fine-tuning retaining gradient descent, and a weight saliency matrix. Such decomposition derived from Euclidean metric encompasses most existing gradient-based MU methods. Nevertheless, adhering to Euclidean space may result in sub-optimal iterative trajectories due to the overlooked geometric structure of the output probability space. We suggest embedding the unlearning update into a manifold rendered by the remaining geometry, incorporating second-order Hessian from the remaining data. It helps prevent effective unlearning from interfering with the retained performance. However, computing the second-order Hessian for large-scale models is intractable. To efficiently leverage the benefits of Hessian modulation, we propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction. Free from specific modal constraints, our approach is adaptable across computer vision unlearning tasks, including classification and generation. Extensive experiments validate our efficacy and efficiency. Notably, our method successfully performs class-forgetting on ImageNet using DiT and forgets a class on CIFAR-10 using DDPM in just 50 steps, compared to thousands of steps required by previous methods.
机器学习(ML)已经 emergence 来提高深度神经网络的隐私和可靠性。近似 ML 是一种适用于大型模型的实际方法。我们对近似 ML 的调查从确定最陡峭的下降方向开始,将输出 Kullback-Leibler 差异最小化到参数附近 exact ML 内。这个探针方向分解成三个组成部分:加权遗忘梯度升迁、细粒度保留梯度下降和权重显着矩阵。 这样的分解基于欧氏距离度量涵盖了大多数现有的基于梯度的 ML 方法。然而,由于输出概率空间被忽视的几何结构,遵循欧氏空间可能会导致 sub-optimal 的迭代轨迹。我们建议将释放更新嵌入到由剩余几何形状生成的流形中,包括剩余数据的第二阶哈希。这有助于防止有效的释放更新干扰保留的性能。然而,计算大型模型的第二阶哈希可能是不切实际的。为了有效地利用哈希调制的优势,我们提出了一个快速慢速参数更新策略,以隐含表示当前显着释放方向。摆脱具体模态限制,我们的方法在计算机视觉学习任务(包括分类和生成)上具有可适应性。大量实验证实了我们的有效性和效率。值得注意的是,我们的方法在用 DiT 和忘记在 ImageNet 上执行分类时成功实现了消阶级,而在 CIFAR-10 上使用 DDPM 时消忘了只需 50 步,而其他方法需要数千步。
https://arxiv.org/abs/2409.19732
Brain CT report generation is significant to aid physicians in diagnosing cranial diseases. Recent studies concentrate on handling the consistency between visual and textual pathological features to improve the coherence of report. However, there exist some challenges: 1) Redundant visual representing: Massive irrelevant areas in 3D scans distract models from representing salient visual contexts. 2) Shifted semantic representing: Limited medical corpus causes difficulties for models to transfer the learned textual representations to generative layers. This study introduces a Pathological Clue-driven Representation Learning (PCRL) model to build cross-modal representations based on pathological clues and naturally adapt them for accurate report generation. Specifically, we construct pathological clues from perspectives of segmented regions, pathological entities, and report themes, to fully grasp visual pathological patterns and learn cross-modal feature representations. To adapt the representations for the text generation task, we bridge the gap between representation learning and report generation by using a unified large language model (LLM) with task-tailored instructions. These crafted instructions enable the LLM to be flexibly fine-tuned across tasks and smoothly transfer the semantic representation for report generation. Experiments demonstrate that our method outperforms previous methods and achieves SoTA performance. Our code is available at this https URL.
脑部CT报告生成对于帮助医生诊断颅内疾病具有重要意义。最近的研究集中于处理视觉和文本病理特征之间的一致性以提高报告的连贯性。然而,存在一些挑战:1)冗余的视觉表示:3D扫描中的大量无关区域会分散模型的表现力,使其难以表示突出视觉上下文。2)转移义素表示:有限医疗语料库导致模型无法将学到的文本表示传递到生成层。本研究引入了病理线索驱动的表示学习(PCRL)模型,以基于病理线索构建跨模态表示,并自然地将它们适应为准确的报告生成。具体来说,我们通过从分割区域、病理实体和报告主题的角度构建病理线索,完全理解视觉病理模式并学习跨模态特征表示。为了适应文本生成任务,我们通过使用具有任务定制指令的统一大型语言模型(LLM)来弥合表示学习和报告生成之间的差距。这些定制指令使LLM能够灵活地调整任务,并平稳地传递报告生成的语义表示。实验证明,我们的方法超越了以前的方法,获得了SoTA性能。我们的代码可在此处访问:https://www.xxxxxxx.com/
https://arxiv.org/abs/2409.19676
The well-worn George Box aphorism ``all models are wrong, but some are useful'' is particularly salient in the cybersecurity domain, where the assumptions built into a model can have substantial financial or even national security impacts. Computer scientists are often asked to optimize for worst-case outcomes, and since security is largely focused on risk mitigation, preparing for the worst-case scenario appears rational. In this work, we demonstrate that preparing for the worst case rather than the most probable case may yield suboptimal outcomes for learning agents. Through the lens of stochastic Bayesian games, we first explore different attacker knowledge modeling assumptions that impact the usefulness of models to cybersecurity practitioners. By considering different models of attacker knowledge about the state of the game and a defender's hidden information, we find that there is a cost to the defender for optimizing against the worst case.
经过深思熟虑的乔治·鲍姆著名的格言“所有模型都是错误的,但有些是有用的”在网络安全领域尤为引人注目,因为模型中构建的假设可能对金融或甚至国家安全产生重大影响。计算机科学家通常被要求优化最坏情况的结果,而由于安全问题主要关注风险缓解,为最坏情况做好准备似乎是合理的。在这项工作中,我们证明了为学习代理做好最坏情况的准备,而不是最可能的情况,可能会导致学习代理的性能最优解。通过随机贝叶斯游戏的透视,我们首先探讨了影响模型对网络安全实践家有用性的攻击者知识建模假设。通过考虑不同模型关于游戏状态和防御者的隐藏信息,我们发现,为应对最坏情况,防御者需要承担代价。
https://arxiv.org/abs/2409.19237
While learning based compression techniques for images have outperformed traditional methods, they have not been widely adopted in machine learning pipelines. This is largely due to lack of standardization and lack of retention of salient features needed for such tasks. Decompression of images have taken a back seat in recent years while the focus has shifted to an image's utility in performing machine learning based analysis on top of them. Thus the demand for compression pipelines that incorporate such features from images has become ever present. The methods outlined in the report build on the recent work done on learning based image compression techniques to incorporate downstream tasks in them. We propose various methods of finetuning and enhancing different parts of pretrained compression encoding pipeline and present the results of our investigation regarding the performance of vision tasks using compression based pipelines.
虽然基于学习的图像压缩技术已经超越了传统方法,但在机器学习管道中并没有得到广泛应用。这很大程度上是因为缺乏标准化和无法保留执行此类任务所需的关键特征。在最近几年里,图像的压缩退到了第二位,而重点转向了在图像上执行机器学习分析的实用性。因此,需求已经迫切需要包括图像特征的压缩管道。报告中所述的方法基于最近基于学习的图像压缩技术的研究,以实现其下游任务的优化。我们提出了各种对预训练压缩编码管道进行微调和优化的方法,并研究了使用基于压缩的管道对视觉任务性能的影响。
https://arxiv.org/abs/2409.19184
We study how differences in persuasive language across Wikipedia articles, written in either English and Russian, can uncover each culture's distinct perspective on different subjects. We develop a large language model (LLM) powered system to identify instances of persuasive language in multilingual texts. Instead of directly prompting LLMs to detect persuasion, which is subjective and difficult, we propose to reframe the task to instead ask high-level questions (HLQs) which capture different persuasive aspects. Importantly, these HLQs are authored by LLMs themselves. LLMs over-generate a large set of HLQs, which are subsequently filtered to a small set aligned with human labels for the original task. We then apply our approach to a large-scale, bilingual dataset of Wikipedia articles (88K total), using a two-stage identify-then-extract prompting strategy to find instances of persuasion. We quantify the amount of persuasion per article, and explore the differences in persuasion through several experiments on the paired articles. Notably, we generate rankings of articles by persuasion in both languages. These rankings match our intuitions on the culturally-salient subjects; Russian Wikipedia highlights subjects on Ukraine, while English Wikipedia highlights the Middle East. Grouping subjects into larger topics, we find politically-related events contain more persuasion than others. We further demonstrate that HLQs obtain similar performance when posed in either English or Russian. Our methodology enables cross-lingual, cross-cultural understanding at scale, and we release our code, prompts, and data.
我们研究了跨语言维基百科文章中说服性语言的差异如何揭示不同文化对不同主题的独特观点。我们开发了一个大型语言模型(LLM)驱动的系统,用于在多语言文本中查找说服性语言实例。我们没有直接向LLM发送指示来检测说服力,这是主观的,很难。相反,我们将任务重新定义为要求高层问题(HLQ),以捕捉不同的说服力方面。重要的是,这些HLQ是由LLM themselves撰写的。LLM过度生成了大量的HLQ,这些HLQ随后被筛选为与原始任务一致的小规模集合。然后,我们将我们的方法应用于一个大型双语维基百科文章数据集(88K总),使用两级识别-提取提示策略来查找说服力实例。我们对每篇文章的说服力进行量化,并通过多次实验对搭配的文章进行比较。值得注意的是,我们在两种语言中生成文章的说服力排名。这些排名与我们对此文化的直觉相符;俄罗斯维基百科突出涉及乌克兰的主题,而英语维基百科则关注中东。将主题分组为更大的主题,我们发现政治相关事件比其他事物更有说服力。我们进一步证明,当以英语或俄罗斯语言提出时,HLQ具有类似的表现。我们的方法使在大规模、跨语言和跨文化方面实现理解,并发布了我们的代码、提示和数据。
https://arxiv.org/abs/2409.19148
There is a mismatch between psychological and computational studies on emotions. Psychological research aims at explaining and documenting internal mechanisms of these phenomena, while computational work often simplifies them into labels. Many emotion fundamentals remain under-explored in natural language processing, particularly how emotions develop and how people cope with them. To help reduce this gap, we follow theories on coping, and treat emotions as strategies to cope with salient situations (i.e., how people deal with emotion-eliciting events). This approach allows us to investigate the link between emotions and behavior, which also emerges in language. We introduce the task of coping identification, together with a corpus to do so, constructed via role-playing. We find that coping strategies realize in text even though they are challenging to recognize, both for humans and automatic systems trained and prompted on the same task. We thus open up a promising research direction to enhance the capability of models to better capture emotion mechanisms from text.
心理和计算研究在情绪方面存在差异。心理研究旨在解释和记录这些现象的内部机制,而计算工作通常将这些现象简化为标签。在自然语言处理中,许多情感基本要素仍然没有被深入研究,特别是情感的发展以及人们如何应对它们。为了缩小这个差距,我们遵循应对理论,将情感视为应对突出情况的策略(即人们如何处理引发情感的事件)。这种方法允许我们研究情感和行为之间的联系,这一联系也在语言中出现。我们引入了应对识别任务以及构建此任务的语料库,通过角色扮演完成。我们发现,尽管应对策略对于人类和通过相同任务训练和提示的自动系统来说具有挑战性,但它们在文本中仍然得以实现。因此,我们开辟了一个有前景的研究方向,以提高模型从文本中更好地捕捉情感机制的能力。
https://arxiv.org/abs/2409.19025
Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model's generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop's role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.
尽管 transformers 被认为是计算机视觉的新标准,但卷积神经网络(CNN)在低数据量的情况下仍然比它们表现更好。然而,CNN 通常基于输入图像的狭小、特定的区域做出决策,尤其是在训练数据有限的情况下。这种行为会严重削弱模型的泛化能力,使其对某些可能不表示图像更广泛上下文的特征过分依赖。虽然导致这种现象的条件仍然很难确定,但本文的主要意图是阐明观察到的神经网络行为。我们的研究旨在优先获得全面的理解,并探讨这个问题的一些初步响应。因此,我们引入了 Saliency Guided Dropout(SGDrop),一种专门针对这个问题的开创性 regularization 方法。SGDrop 利用特征图上的归一化方法来确定并减小在训练过程中最引人注目的特征的影响。这一过程鼓励网络多样化其关注,不要只关注特定突出区域。我们在多个视觉分类基准上的实验证实了 SGDrop 在增强泛化能力方面的作用。值得注意的是,包括 SGDrop 在内的模型具有更广泛的归因和神经活动,提供了输入图像的更全面视图,与传统训练相比,这种视图更具价值。
https://arxiv.org/abs/2409.17370
The climate crisis is a salient issue in online discussions, and hypocrisy accusations are a central rhetorical element in these debates. However, for large-scale text analysis, hypocrisy accusation detection is an understudied tool, most often defined as a smaller subtask of fallacious argument detection. In this paper, we define hypocrisy accusation detection as an independent task in NLP, and identify different relevant subtypes of hypocrisy accusations. Our Climate Hypocrisy Accusation Corpus (CHAC) consists of 420 Reddit climate debate comments, expert-annotated into two different types of hypocrisy accusations: personal versus political hypocrisy. We evaluate few-shot in-context learning with 6 shots and 3 instruction-tuned Large Language Models (LLMs) for detecting hypocrisy accusations in this dataset. Results indicate that the GPT-4o and Llama-3 models in particular show promise in detecting hypocrisy accusations (F1 reaching 0.68, while previous work shows F1 of 0.44). However, context matters for a complex semantic concept such as hypocrisy accusations, and we find models struggle especially at identifying political hypocrisy accusations compared to personal moral hypocrisy. Our study contributes new insights in hypocrisy detection and climate change discourse, and is a stepping stone for large-scale analysis of hypocrisy accusation in online climate debates.
网络讨论中的气候危机是一个突出的问题,而虚伪指控是这些辩论中的中心论据。然而,对于大规模文本分析,虚伪指控检测是一个被低估的工具,通常被定义为谬误论据检测的一个较小的子任务。在本文中,我们将虚伪指控检测定义为自然语言处理(NLP)中的一个独立任务,并识别出不同的相关虚伪指控亚类型。我们的“气候虚伪指控集”(CHAC)包括420个Reddit气候变化辩论评论,由专家审慎分类为两种不同的虚伪指控:个人虚伪指控和政治虚伪指控。我们对这个数据集使用少样本的上下文学习评估了6 shot和3个指令调整的大语言模型(LLMs)检测虚伪指控的效果。结果表明,特别是GPT-4o和Llama-3模型在检测虚伪指控方面表现出前景(F1达到0.68,而之前的论文表明F1为0.44)。然而,对于复杂的概念如虚伪指控,语境很重要,我们发现这些模型在政治虚伪指控的识别上尤其挣扎。我们的研究为虚伪检测和气候变化讨论做出了新的贡献,并为在线气候变化辩论的虚伪指控大型分析奠定了基础。
https://arxiv.org/abs/2409.16807