The proliferation of applications using artificial intelligence (AI) systems has led to a growing number of users interacting with these systems through sophisticated interfaces. Human-computer interaction research has long shown that interfaces shape both user behavior and user perception of technical capabilities and risks. Yet, practitioners and researchers evaluating the social and ethical risks of AI systems tend to overlook the impact of anthropomorphic, deceptive, and immersive interfaces on human-AI interactions. Here, we argue that design features of interfaces with adaptive AI systems can have cascading impacts, driven by feedback loops, which extend beyond those previously considered. We first conduct a scoping review of AI interface designs and their negative impact to extract salient themes of potentially harmful design patterns in AI interfaces. Then, we propose Design-Enhanced Control of AI systems (DECAI), a conceptual model to structure and facilitate impact assessments of AI interface designs. DECAI draws on principles from control systems theory -- a theory for the analysis and design of dynamic physical systems -- to dissect the role of the interface in human-AI systems. Through two case studies on recommendation systems and conversational language model systems, we show how DECAI can be used to evaluate AI interface designs.
人工智能(AI)系统的应用程序的普及导致越来越多的用户通过复杂的界面与这些系统互动。人机交互研究已经证明,界面不仅塑造了用户的行為,还塑造了用户对技术能力和风险的认知。然而,评估AI系统的社会和道德风险的实践者和研究人员往往忽视了类人形、欺骗性和沉浸式界面对人类-AI互动的影响。在这里,我们认为具有自适应AI系统的界面设计特征可能会产生级联影响,这种影响超越了之前考虑的范围。我们首先对AI界面设计进行了范围审查,以提取可能对AI界面产生有害设计模式的主题。然后,我们提出了设计增强控制AI系统(DECAI)的概念模型,用于结构和促进AI界面设计的影響評估。DECAI借鉴了控制理论--一种用于分析和管理动态物理系统的理论--来阐明界面在人类-AI系统中的作用。通过推荐系统和会话语言模型系统的两个案例研究,我们展示了如何使用DECAI评估AI界面设计。
https://arxiv.org/abs/2404.11370
Inquisitive questions -- open-ended, curiosity-driven questions people ask as they read -- are an integral part of discourse processing (Kehler and Rohde, 2017; Onea, 2016) and comprehension (Prince, 2004). Recent work in NLP has taken advantage of question generation capabilities of LLMs to enhance a wide range of applications. But the space of inquisitive questions is vast: many questions can be evoked from a given context. So which of those should be prioritized to find answers? Linguistic theories, unfortunately, have not yet provided an answer to this question. This paper presents QSALIENCE, a salience predictor of inquisitive questions. QSALIENCE is instruction-tuned over our dataset of linguist-annotated salience scores of 1,766 (context, question) pairs. A question scores high on salience if answering it would greatly enhance the understanding of the text (Van Rooy, 2003). We show that highly salient questions are empirically more likely to be answered in the same article, bridging potential questions (Onea, 2016) with Questions Under Discussion (Roberts, 2012). We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
好奇的问题 -- 开放性的、以好奇心为导向的问题,人们在阅读中提出的问题 -- 是语义处理(Kehler和Rohde,2017;Onea,2016)和理解(Prince,2004)的重要组成部分。近年来,自然语言处理(NLP)工作充分利用了大型语言模型的问句生成能力,增强了广泛的应用。但是,好奇的问题的空间是广阔的:可以从给定的上下文中引发许多问题。那么,应该优先考虑哪些问题来寻找答案呢?不幸的是,语言理论尚未回答这个问题。本文介绍了 QSALIENCE,一个好奇问题预测器。QSALIENCE 是通过我们数据集中的1766个(上下文,问题)对进行语言学家标注的语义分数进行指令调整的。问题得分高,如果回答它会大大增强对文本的理解(Van Rooy,2003)。我们证明了,高度耸人听闻的问题在实证上更有可能在相同的文章中被回答,将潜在问题(Onea,2016)与正在讨论的问题(Roberts,2012)联系起来。我们进一步验证了我们的研究结果,通过展示回答耸人听闻的问题是新闻摘要质量的指标,来进一步验证我们的发现。
https://arxiv.org/abs/2404.10917
The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. It has been used previously in scene exploration tasks. In this paper, we revisit the model and extend its application to visual search tasks. To illustrate the benefits of using semantic information in scene exploration and visual search tasks, we compare its performance against traditional saliency-based models. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model in accurately representing the semantic information present in the visual scene. In visual search experiments, searching for instances of a target class in a visual field containing multiple distractors shows superior performance compared to the saliency-driven model and a random gaze selection algorithm. Our results demonstrate that semantic information, from the top-down, influences visual exploration and search tasks significantly, suggesting a potential area of research for integrating it with traditional bottom-up cues.
本文旨在探讨一个最近基于语义的信息提取的视野主动感知模型的准确性和其在完成人类通常执行的视觉任务(场景探索和视觉搜索)方面的能力。该模型利用当前物体检测器定位和分类大量物体类别的功能,并在多个注视点上更新场景的语义描述。它在之前用于场景探索任务中已经应用过。在本文中,我们重新审视了该模型,并将其应用于视觉搜索任务。为了说明在场景探索和视觉搜索任务中使用语义信息的优势,我们将其性能与传统基于 saliency 的模型进行比较。在场景探索任务中,基于语义的方法在准确表示视觉场景中的语义信息方面表现出优越性能。在视觉搜索实验中,在包含多个干扰物的视觉区域内搜索目标类别的实例,与基于 saliency 的模型和随机注视选择算法相比,表现出优越性能。我们的结果表明,从上到下,语义信息对视觉探索和搜索任务具有显著影响,这表明了一个可能的研究领域,将语义信息与传统自上而下提示相结合。
https://arxiv.org/abs/2404.10836
Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
在包括创意设计和电子商务在内的各个领域中,生成突出物场景对突出物在场景中的表现和上下文至关重要。通过将对象整合到定制环境中,可以增强主题的表现和上下文。生成背景的过程可以看作是一个文本条件下的修复绘画任务,其目标是将图像内容扩展到突出物的边界之外。尽管引导文本修复绘图模型(例如)也可以通过遮罩反向填充进行修复,但它们通过填充图像的缺失部分来修复图像,而不是将物体放入场景中。因此,当用于背景生成时,修复绘图模型经常扩展突出物的边界,从而改变物体的身份,这种现象我们称之为“物体膨胀”。本文介绍了一个使用Stable Diffusion和ControlNet架构将修复扩散模型适应突出物修复任务的模型。我们在模型和数据集上展示了的一系列定性和定量结果,包括一个不需要任何人类标注的新指标来衡量物体膨胀。与Stable Diffusion 2.0修复绘图相比,我们提出的方法在多个数据集上的标准视觉指标上减少了3.6倍的物体膨胀。
https://arxiv.org/abs/2404.10157
Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
基于扩散模型的视频生成研究进展迅速。然而,对象保真度和生成长度的限制使得其应用受限。此外,需要无缝循环的特定领域(如动画壁纸)要求第一和最后一帧的视频顺畅播放。为了应对这些挑战,本文提出了一种名为LoopAnimate的新方法,用于生成具有一致开始和结束帧的视频。为了提高对象保真度,我们引入了一个框架,将多层图像出现和文本语义信息解耦。基于图像到图像扩散模型,我们的方法从输入图像中引入了像素级和特征级信息,并在扩散模型的不同位置注入图像外观和文本语义嵌入。现有的UNet-based视频生成模型在训练过程中需要输入整个视频。然而,由于GPU内存的限制,通常帧数限制为16。为了克服这一挑战,本文提出了一种三阶段训练策略,其中帧数逐渐增加,并减少细调模块。此外,我们引入了Temporal Enhanced Motion Module(TEMM),以扩展编码时间和空间信息的能力,达到36帧。所提出的LoopAnimate方法,第一次将UNet-based视频生成模型的单过道生成长度扩展到35帧,同时保持高质量的视频生成。实验证明,LoopAnimate在客观指标(如保真度和时间一致性)和主观评价结果方面实现了最先进的性能。
https://arxiv.org/abs/2404.09172
Learning the skill of human bimanual grasping can extend the capabilities of robotic systems when grasping large or heavy objects. However, it requires a much larger search space for grasp points than single-hand grasping and numerous bimanual grasping annotations for network learning, making both data-driven or analytical grasping methods inefficient and insufficient. We propose a framework for bimanual grasp saliency learning that aims to predict the contact points for bimanual grasping based on existing human single-handed grasping data. We learn saliency corresponding vectors through minimal bimanual contact annotations that establishes correspondences between grasp positions of both hands, capable of eliminating the need for training a large-scale bimanual grasp dataset. The existing single-handed grasp saliency value serves as the initial value for bimanual grasp saliency, and we learn a saliency adjusted score that adds the initial value to obtain the final bimanual grasp saliency value, capable of predicting preferred bimanual grasp positions from single-handed grasp saliency. We also introduce a physics-balance loss function and a physics-aware refinement module that enables physical grasp balance, capable of enhancing the generalization of unknown objects. Comprehensive experiments in simulation and comparisons on dexterous grippers have demonstrated that our method can achieve balanced bimanual grasping effectively.
学习人类双 manual抓取技能可以扩展机器人系统在抓取大型或重物时的能力。然而,它需要比单手抓取和大量的双手抓取注释更大的抓点搜索空间,使得数据驱动或分析性抓取方法低效和不足。我们提出了一个双手抓取局部注意力学习框架,旨在根据现有的人类单手抓取数据预测双手抓取的接触点。我们通过最小程度的双手抓取注释学习相应的局部重要性向量,建立了手部抓取位置之间的对应关系,能够消除训练大规模双手抓取数据集的需求。现有的单手抓取局部重要性值作为双手抓取局部重要性的初始值,我们学习了一个局部重要性调整得分,将初始值加起来以获得最终双手抓取局部重要性值,能够预测从单手抓取局部重要性中预测更喜欢的手部抓取位置。我们还引入了物理平衡损失函数和物理感知平滑模块,使得手部平衡得以实现,能够增强对未知物体的泛化能力。通过仿真实验和双手灵巧爪器的比较,我们的方法可以有效实现平衡双手抓取。
https://arxiv.org/abs/2404.08944
In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-effective approach that significantly enhances the pre-training performance of Masked Image Modeling (MIM) approaches by prioritizing token salience. Our method provides robustness against variations in masking ratios, effectively mitigating the performance instability issues common in existing methods. This relaxes the sensitivity of MIM-based pre-training to masking ratios, which in turn allows us to propose an adaptive strategy for `tailored' masking ratios for each data sample, which no existing method can provide. Toward this goal, we propose an Adaptive Masking Ratio (AMR) strategy that dynamically adjusts the proportion of masking for the unique content of each image based on token salience. We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.
在本文中,我们提出了Saliency-Based Adaptive Masking(SBAM)方法,一种新颖且成本效益高的方法,它通过优先考虑token的 salience显著增强了基于掩码图像建模(MIM)方法的预训练性能。我们的方法对于掩码比的变化具有鲁棒性,有效缓解了现有方法中常见的性能不稳定问题。这使得基于MIM的预训练对掩码比的变化更加敏感,进而允许我们为每个数据样本提出自适应的掩码比策略,而现有的方法无法实现。为此,我们提出了一个自适应掩码比(AMR)策略,根据token的 salience动态调整每个图像的掩码比例。我们证明了我们的方法在ImageNet-1K数据集上的基于掩码的预训练方面显著优于现有方法。
https://arxiv.org/abs/2404.08327
As the global population ages, the number of fall-related incidents is on the rise. Effective fall detection systems, specifically in healthcare sector, are crucial to mitigate the risks associated with such events. This study evaluates the role of visual context, including background objects, on the accuracy of fall detection classifiers. We present a segmentation pipeline to semi-automatically separate individuals and objects in images. Well-established models like ResNet-18, EfficientNetV2-S, and Swin-Small are trained and evaluated. During training, pixel-based transformations are applied to segmented objects, and the models are then evaluated on raw images without segmentation. Our findings highlight the significant influence of visual context on fall detection. The application of Gaussian blur to the image background notably improves the performance and generalization capabilities of all models. Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms. However, we demonstrate that object-specific contextual transformations during training effectively mitigate this challenge. Further analysis using saliency maps supports our observation that visual context is crucial in classification tasks. We create both dataset processing API and segmentation pipeline, available at this https URL.
随着全球人口老龄化,与跌倒相关的事件数量不断上升。有效的跌倒检测系统,特别是在医疗行业,对降低这些事件所带来的风险至关重要。本研究评估了视觉上下文(包括背景物体)对跌倒分类器准确性的影响。我们提出了一个半自动化的图像分割管道,用于在图像中自动分离个体和物体。我们训练和评估了Well-established模型,如ResNet-18,EfficientNetV2-S和Swin-Small。在训练过程中,对分割的物体应用了基于像素的变换,然后对未经分割的原始图像进行评估。我们的研究结果强调了视觉上下文对跌倒检测的显著影响。对图像背景应用高斯模糊显著提高了所有模型的性能和泛化能力。具有床、椅或轮椅等背景物体可能会挑战跌倒检测系统,导致假阳性警报。然而,我们证明了在训练过程中对特定物体进行上下文转换有效地缓解了这一挑战。通过使用清晰度图进行进一步分析,支持了我们的观察,视觉上下文在分类任务中至关重要。我们还创建了可用于数据集处理的API和分割管道,可在此链接处访问:https://www. this URL。
https://arxiv.org/abs/2404.08088
The field of eXplainable artificial intelligence (XAI) has produced a plethora of methods (e.g., saliency-maps) to gain insight into artificial intelligence (AI) models, and has exploded with the rise of deep learning (DL). However, human-participant studies question the efficacy of these methods, particularly when the AI output is wrong. In this study, we collected and analyzed 156 human-generated text and saliency-based explanations collected in a question-answering task (N=40) and compared them empirically to state-of-the-art XAI explanations (integrated gradients, conservative LRP, and ChatGPT) in a human-participant study (N=136). Our findings show that participants found human saliency maps to be more helpful in explaining AI answers than machine saliency maps, but performance negatively correlated with trust in the AI model and explanations. This finding hints at the dilemma of AI errors in explanation, where helpful explanations can lead to lower task performance when they support wrong AI predictions.
领域 of 可解释性人工智能(XAI)已经产生了大量的方法(例如,局部可解释性图)来深入了解人工智能(AI)模型,并随着深度学习(DL)的兴起而爆炸式增长。然而,人类参与的研究质疑这些方法的有效性,特别是当 AI 的输出是错误的时。在这项研究中,我们收集和分析了来自问题回答任务(N=40)中的人类生成的156个文本和基于局部可解释性的解释,并将它们与最先进的 XAI 解释(集成梯度,保守的 LRP 和 ChatGPT)在人类参与的研究(N=136)中进行比较。我们的研究结果表明,参与者认为人类局部可解释性图对解释 AI 答案的帮助比机器局部可解释性图更大,但性能与信任 AI 模型和解释的准确性呈负相关。这一发现揭示了 AI 错误解释的困境,即有益的解释可能会导致当它们支持错误的 AI 预测时,任务表现下降。
https://arxiv.org/abs/2404.07725
In this paper we have present an improved Cycle GAN based model for under water image enhancement. We have utilized the cycle consistent learning technique of the state-of-the-art Cycle GAN model with modification in the loss function in terms of depth-oriented attention which enhance the contrast of the overall image, keeping global content, color, local texture, and style information intact. We trained the Cycle GAN model with the modified loss functions on the benchmarked Enhancing Underwater Visual Perception (EUPV) dataset a large dataset including paired and unpaired sets of underwater images (poor and good quality) taken with seven distinct cameras in a range of visibility situation during research on ocean exploration and human-robot cooperation. In addition, we perform qualitative and quantitative evaluation which supports the given technique applied and provided a better contrast enhancement model of underwater imagery. More significantly, the upgraded images provide better results from conventional models and further for under water navigation, pose estimation, saliency prediction, object detection and tracking. The results validate the appropriateness of the model for autonomous underwater vehicles (AUV) in visual navigation.
在本文中,我们提出了一个改进的基于Cycle GAN的深海图像增强模型。我们利用了最先进的Cycle GAN模型的循环一致学习技术,并对损失函数进行了修改,以实现深度定向关注,从而增强整个图像的对比度,同时保留全局内容、颜色、局部纹理和样式信息。我们使用修改后的损失函数在经过充分验证的深海视觉感知(EUPV)数据集上训练了Cycle GAN模型,该数据集包括由七种不同相机在各种能见度条件下拍摄的 paired和未 paired水下图像(劣质和优质)。此外,我们还进行了定性和定量的评估,证明了所提出的技术具有实际应用价值,并提供了更好的水下图像增强模型。值得注意的是,升级后的图像在传统模型的基础上表现更好,对于水下导航、姿态估计、熵检测、物体检测和跟踪等应用具有更高的性能。这些结果证实了该模型在自主水下车辆(AUV)视觉导航方面的适用性。
https://arxiv.org/abs/2404.07649
The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: this https URL.
翻译 视觉注意力的分析和预测在计算机视觉和图像处理领域一直是关键任务。在实际应用中,图像通常会伴随各种文本描述,然而,很少有研究探讨文本描述对视觉注意力的影响,更不用说开发考虑文本指导的视觉显著性预测模型了。在本文中,我们全面研究了基于文本引导的图像显著性(TIS)的 both 主观 和 客观 方面。具体来说,我们构建了一个名为 SJTU-TIS 的 TIS 数据库,包括 1200 个文本-图像对及其相应的收集的眼动数据。基于建立的 SJTU-TIS 数据库,我们分析了各种文本描述对视觉注意力的影响。然后,为了促进考虑文本影响的发展,我们使用最先进的视觉显著性模型构建了基于建立的 SJTU-TIS 数据库的基准。最后,在考虑文本描述对视觉注意力影响的大多数现有视觉显著性模型忽略了这个影响的情况下,我们进一步提出了一个文本引导的视觉显著性(TGSal)预测模型,该模型提取和整合图像特征和文本特征,以在各种文本描述条件下预测图像的视觉显著性。我们提出的模型在 SJTU-TIS 数据库和纯图像显著性数据库上显著优于最先进的视觉显著性模型。SJTU-TIS 数据库和所提出的 TGSal 模型的代码将在此处发布:https://this URL。
https://arxiv.org/abs/2404.07537
While reinforcement learning (RL) algorithms have been successfully applied across numerous sequential decision-making problems, their generalization to unforeseen testing environments remains a significant concern. In this paper, we study the problem of out-of-distribution (OOD) detection in RL, which focuses on identifying situations at test time that RL agents have not encountered in their training environments. We first propose a clarification of terminology for OOD detection in RL, which aligns it with the literature from other machine learning domains. We then present new benchmark scenarios for OOD detection, which introduce anomalies with temporal autocorrelation into different components of the agent-environment loop. We argue that such scenarios have been understudied in the current literature, despite their relevance to real-world situations. Confirming our theoretical predictions, our experimental results suggest that state-of-the-art OOD detectors are not able to identify such anomalies. To address this problem, we propose a novel method for OOD detection, which we call DEXTER (Detection via Extraction of Time Series Representations). By treating environment observations as time series data, DEXTER extracts salient time series features, and then leverages an ensemble of isolation forest algorithms to detect anomalies. We find that DEXTER can reliably identify anomalies across benchmark scenarios, exhibiting superior performance compared to both state-of-the-art OOD detectors and high-dimensional changepoint detectors adopted from statistics.
虽然强化学习(RL)算法已经在许多序列决策问题中成功地应用,但将其推广到未见过的测试环境中仍然是一个重大问题。在本文中,我们研究了在RL中检测离散(OD)问题的方法,该方法专注于在测试时间确定RL代理商在训练环境中未遇到的情况。我们首先提出了在RL中检测OD的术语澄清,将其与来自其他机器学习领域的文献对齐。然后我们提出了新的OD检测基准场景,将这些场景中的异常与代理-环境循环的时域自相关引入到不同的组件中。我们认为,尽管这些场景与现实世界的情况相关,但它们在当前文献中仍然被低估。为了解决这个问题,我们提出了一个名为DEXTER(通过提取时间序列表示来检测OD)的新方法。通过将环境观察作为时间序列数据,DEXTER提取出显著的时间序列特征,并利用隔离森林算法的集合来检测异常。我们发现,DEXTER可以在基准场景中可靠地识别异常,其性能优于当前的OD检测器和从统计学中采用的高维变化点检测器。
https://arxiv.org/abs/2404.07099
Many explainable AI (XAI) techniques strive for interpretability by providing concise salient information, such as sparse linear factors. However, users either only see inaccurate global explanations, or highly-varying local explanations. We propose to provide more detailed explanations by leveraging the human cognitive capacity to accumulate knowledge by incrementally receiving more details. Focusing on linear factor explanations (factors $\times$ values = outcome), we introduce Incremental XAI to automatically partition explanations for general and atypical instances by providing Base + Incremental factors to help users read and remember more faithful explanations. Memorability is improved by reusing base factors and reducing the number of factors shown in atypical cases. In modeling, formative, and summative user studies, we evaluated the faithfulness, memorability and understandability of Incremental XAI against baseline explanation methods. This work contributes towards more usable explanation that users can better ingrain to facilitate intuitive engagement with AI.
许多可解释人工智能(XAI)技术通过提供简洁明了的信息,如稀疏线性因素,试图实现可解释性。然而,用户可能只看到不准确的全局解释,或者高度分散的局部解释。我们通过利用人类知识累积能力,通过逐步接收更多细节来提供更多详细解释。专注于线性因素解释(因素 $\times$ 值 = 结果),我们引入了递增式XAI,通过提供基线+递增因素来帮助用户阅读和记忆更准确的解释。通过重用基因素并减少异常情况中显示的因子数量,可以提高记忆性。在建模、形成性以及总结性用户研究中,我们评估了递增式XAI相对于基线解释方法的 faithfulness、memorability 和understandability。这项工作为用户能够更好地理解和内置AI提供了更有用的解释,从而促进了用户与AI的直觉性互动。
https://arxiv.org/abs/2404.06733
As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.
随着大型自然语言处理模型(LLMs)将自然语言处理的力量扩展到处理长输入,进行严谨和系统的分析以了解其能力和行为是必要的。一个显著的应用是总结,因为它的普遍性和争议(例如,研究人员宣称总结已经过时了)。在本文中,我们使用财务报告总结作为一个案例研究,因为财务报告不仅很长,而且使用大量的数字和表格。我们提出了一个计算框架来表征多模态长形式总结,并研究了Claude 2.0/2.1,GPT-4/3.5和Command的行为。我们发现,GPT-3.5和Command无法以有意义的方式完成总结任务。对于Claude 2和GPT-4,我们分析总结的提取性,并指出LLMs中存在的位置偏见。这种位置偏见在Shuffle输入后消失,这表明Claude具有识别重要信息的能力。我们还对LLM生成的总结中使用数字数据进行了全面调查,并为数字异端行为提供了一个分类。我们采用提示工程来提高GPT-4在有限成功情况下使用数字的能力。总的来说,我们的分析突出了Claude 2在处理长多模态输入方面的强大能力与GPT-4之间的显著差异。
https://arxiv.org/abs/2404.06162
While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at this https URL.
虽然Vision-Language模型(VLMs)在迁移学习方面表现出色,但它们由于参数数量众多,导致计算成本较高。为解决这一问题,通过模型剪裁删除参数是一种可行的解决方案。然而,现有的VLM技术针对具体任务,因此需要从零开始网络剪裁以处理感兴趣的新任务。在这项工作中,我们探讨了一个新的方向:任务无关的Vision-Language剪裁(TA-VLP)。给定预训练的VLM,目标是从多个未知的下游任务中找到一个独特的修剪后可传输的对照转移表示。在具有挑战性的设置中,预训练模型中已经编码的转移表示是关键,以保留这种转移表示。因此,我们提出了Multimodal Flow Pruning(MULTIFLOW)作为TA-VLP第一个无梯度免费的剪裁框架: (i)参数的重要性用其大小和信息流表示,通过融入其连接的神经元的置信度来表示; (ii)剪裁是由预训练VLM参数的涌现(多模态)分布驱动的。 我们在TA-VLP的背景下对八个最先进的剪裁算法进行了基准测试,尝试了两种VLM和三个视觉语言任务,并研究了三个剪裁比。我们的实验结果表明,MULTIFLOW在大多数情况下都超过了最近最先进的组合剪裁竞争对手,为解决TA-VLP铺平道路。代码公开在https://这个URL上。
https://arxiv.org/abs/2404.05621
In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is architected to synchronize potential known class samples across both the labeled (source) and unlabeled (target) datasets, while emphasizing the distinct categorization of the target data. To facilitate this, we propose an entropy-driven adversarial learning strategy that accounts for the distance distributions of target samples relative to source-domain class prototypes. Parallelly, the discriminative nature of the shared space is upheld through a fusion of three metric learning objectives. In the source domain, our focus is on refining the proximity between samples and their affiliated class prototypes, while in the target domain, we integrate a neighborhood-centric contrastive learning mechanism, enriched with an adept neighborsmining approach. To further accentuate the nuanced feature interrelation among semantically aligned images, we champion the concept of conditional image inpainting, underscoring the premise that semantically analogous images prove more efficacious to the task than their disjointed counterparts. Experimentally, CDAD-NET eclipses existing literature with a performance increment of 8-15% on three AD-GCD benchmarks we present.
在通用类别发现(GCD)中,我们通过利用已知类别的训练数据对已知和未知的类别进行聚类。由于这些数据之间的领域转移,一个显著的挑战出现了。为了应对这个问题,我们提出了一个新场景:跨领域通用类别发现(AD-GCD)和类域发现器跨领域(CDAD-NET)作为解决方法。CDAD-NET旨在同步已知类别的样本在已知类别的(源)和未知的(目标)数据集中的潜在样本,同时强调目标数据的独特分类。为了促进这一目标,我们提出了一个熵驱动的对抗学习策略,考虑了目标样本与源领域类别原型之间的距离分布。同时,通过融合三个度量学习目标来维持共享空间判别性的特征。在源领域,我们的关注点是改进样本与相关类别原型之间的接近程度,而在目标领域,我们引入了一种以邻域为中心的对比学习机制,并支持一个智能邻居挖掘方法。为了进一步强调语义对齐图像之间细微特征之间的关联,我们倡导条件图像修复这一概念,强调语义类似于图像证明比它们的离散对应物更有效地完成任务。在实验中,CDAD-NET在三个AD-GCD基准测试上的性能提高了8-15%。
https://arxiv.org/abs/2404.05366
Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.
近年来,将预训练的视觉 transformer 适应各种下游任务显示出巨大的潜力。然而,大多数现有解决方案在每层独立优化提示,从而忽视了提示词在层间编码的任务相关信息。此外,现有的提示结构容易受到输入图像中与任务无关的噪声的干扰,这会损害任务相关信息的共享。在本文中,我们提出了一个新颖的 VPT 方法:iVPT。它创新地引入了跨层动态连接(CDC)来共享相邻层输入提示的 task-relevant information,实现有效共享任务相关信息。此外,我们还设计了一个动态聚合(DA)模块,促进层间信息的选择性共享。CDC 和 DA 的结合增强了 VPT 框架内注意过程的灵活性。在此基础上,iVPT 引入了注意性的强化(AR)机制,通过自适应地识别显眼的图像词元,进一步通过提示词进行增强。在 24 个图像分类和语义分割基准上的实验表明,与最先进的对照相比,iVPT 具有显著的优势。
https://arxiv.org/abs/2404.05207
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
视觉里程计(VO)对于自主系统的导航至关重要,它可以在合理的成本下提供准确的定位和方向估计。虽然传统的VO方法在某些情况下表现出色,但它们在多变的光线和运动模糊等情况下遇到了挑战。基于深度学习的VO虽然更具有适应性,但在新的环境中可能会面临泛化问题。为了解决这些缺点,本文提出了一种新颖的混合视觉里程计(VO)框架,该框架利用姿态仅监督,提供了一种平衡的解决方案,即稳健性和大量标注的必要性。我们提出了两种成本效益和创新的设想:自监督同构预训练以增强姿态仅标签的光学流学习,以及基于随机补丁的显着点检测策略,用于更准确的光学流补丁提取。这些设计消除了训练和密集光学流标签的需要,显著提高了系统在多样和具有挑战性的环境中的泛化能力。我们姿态仅监督的方法在标准数据集上实现了与先进密集光学流监督方法的竞争性能,在极端和未见场景中具有更大的鲁棒性和泛化能力,即使与密集光学流监督方法相比也是如此。
https://arxiv.org/abs/2404.04677
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However, standard gradient-based interpretation maps, including the simple gradient and integrated gradient algorithms, often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A frequently used approach to inducing sparsity structures into gradient-based saliency maps is to alter the simple gradient scheme using sparsification or norm-based regularization. A drawback with such post-processing methods is their frequently-observed significant loss in fidelity to the original simple gradient map. In this work, we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We show a duality relation between the regularized norms of the adversarial perturbations and gradient-based maps, based on which we design adversarial training loss functions promoting sparsity and group-sparsity properties in simple gradient maps. We present several numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets.
基于梯度的显着性图已被广泛用于解释深度神经网络分类器的决策。然而,标准的基于梯度的解释图,包括简单的梯度和集成梯度算法,通常缺乏其在现实世界计算机视觉模型上的所需的结构,如稀疏性和连通性。一种常用的将稀疏结构诱导到基于梯度的显着性图的方法是使用稀疏化或基于规范的 Regularization。然而,这种后处理方法经常观察到对原始简单梯度图的保真度显著下降。在本文中,我们将 adversarial 训练作为一种加工方案应用于具有结构化简单梯度图的神经网络的训练中。我们基于规范的梯度扰动的有界性和梯度-基于地图的稀疏性和群稀疏性性质,设计了一种促进简单梯度图稀疏性和群稀疏性特性的 adversarial 训练损失函数。我们提供了几个数值结果,以展示我们的基于规范的 adversarial 训练方法对标准神经网络架构标准梯度-基于地图的显着性图的影响。
https://arxiv.org/abs/2404.04647
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
目前存在两种极端观点用于神经网络特征学习:(i)神经网络简单地实现核方法(类似于NTK),因此没有特征被学习;(ii)神经网络可以表示(因此可以学习)适合数据的有层次特征。在本文中,我们认为基于一种新颖的观点,这两种解释都不太可能正确。神经网络可以看作是一个专家的混合,每个专家对应于一个(层数长度)通过隐藏单元的序列路径。我们使用这种替代解释来激励一个模型,称为深度线性有门网络(DLGN),该模型处于深度线性网络和ReLU网络之间。与深度线性网络不同,DLGN能够学习非线性特征(然后将这些特征进行线性组合),与ReLU网络不同,这些特征最终是简单的——每个特征实际上是输入空间中(层数)半空间的指示函数。这种观点允许对特征进行全面的全局可视化,而不仅仅是基于局部可视化对神经元的可视化。DLGN中的特征学习已经被证明是存在的,而且是通过在输入空间中学习包含目标函数平滑区域的半空间来实现的。由于DLGN的结构,后层的神经元与前层的神经元本质上相同——它们都代表一个半空间——然而,梯度下降的动态使后层神经元的聚类特征更加明显。我们假设ReLU网络也具有类似特征学习行为。
https://arxiv.org/abs/2404.04312