Modern face recognition systems utilize deep neural networks to extract salient features from a face. These features denote embeddings in latent space and are often stored as templates in a face recognition system. These embeddings are susceptible to data leakage and, in some cases, can even be used to reconstruct the original face image. To prevent compromising identities, template protection schemes are commonly employed. However, these schemes may still not prevent the leakage of soft biometric information such as age, gender and race. To alleviate this issue, we propose a novel technique that combines Fully Homomorphic Encryption (FHE) with an existing template protection scheme known as PolyProtect. We show that the embeddings can be compressed and encrypted using FHE and transformed into a secure PolyProtect template using polynomial transformation, for additional protection. We demonstrate the efficacy of the proposed approach through extensive experiments on multiple datasets. Our proposed approach ensures irreversibility and unlinkability, effectively preventing the leakage of soft biometric attributes from face embeddings without compromising recognition accuracy.
现代面部识别系统利用深度神经网络从面部中提取显著特征。这些特征表示潜在空间中的嵌入,通常被面部识别系统中的模板存储。这些嵌入很容易受到数据泄漏的影响,在某些情况下,甚至可以用于重构原始面部图像。为了防止泄露身份,通常采用模板保护方案。然而,这些方案可能仍无法防止软生物特征(如年龄、性别和种族)的泄露。为了缓解这个问题,我们提出了一种结合完全同态加密(FHE)和已知模板保护方案(PolyProtect)的新技术。我们证明了使用FHE可以压缩和加密嵌入,并且可以使用多项式变换将其转换为安全的PolyProtect模板,提供额外的保护。我们通过在多个数据集上进行广泛实验,证明了所提出方法的有效性。与我们的方法相比,确保不可逆性和解链性,有效防止了未经过授权的软生物特征从面部嵌入中泄露,同时保持识别准确性。
https://arxiv.org/abs/2404.16255
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
This paper proposes a new gradient-based XAI method called Guided AbsoluteGrad for saliency map explanations. We utilize both positive and negative gradient magnitudes and employ gradient variance to distinguish the important areas for noise deduction. We also introduce a novel evaluation metric named ReCover And Predict (RCAP), which considers the Localization and Visual Noise Level objectives of the explanations. We propose two propositions for these two objectives and prove the necessity of evaluating them. We evaluate Guided AbsoluteGrad with seven gradient-based XAI methods using the RCAP metric and other SOTA metrics in three case studies: (1) ImageNet dataset with ResNet50 model; (2) International Skin Imaging Collaboration (ISIC) dataset with EfficientNet model; (3) the Places365 dataset with DenseNet161 model. Our method surpasses other gradient-based approaches, showcasing the quality of enhanced saliency map explanations through gradient magnitude.
本文提出了一种名为Guided AbsoluteGrad的新XAI方法,用于解释显著性图。我们利用正负梯度幅度并采用梯度方差来区分重要的噪声推断区域。我们还引入了一个名为ReCover And Predict(RCAP)的新评估指标,它考虑了解释的局部化和视觉噪声水平目标。我们提出了这两个目标的两个命题,并证明评估它们是必要的。我们在三个案例研究中使用RCAP和其他SOTA指标评估Guided AbsoluteGrad:(1)使用ResNet50模型的ImageNet数据集;(2)使用EfficientNet模型的ISIC数据集;(3)使用DenseNet161模型的Places365数据集。我们的方法超越了其他基于梯度的方法,通过梯度幅度展示了增强显著性图解释的质量。
https://arxiv.org/abs/2404.15564
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in ECSSD dataset with 80.2M trained parameters, 21% better than transformer-based SOTA model and 47% better than CNN-based SOTA model.
突出物体检测(SOD)旨在在图像和输出中找到最具突出的物体,并输出二进制掩码级像素级。基于Transformer的方法由于其全局语义理解,对于识别突出物体至关重要。然而,这些模型往往较大,并需要大量的训练参数。为了更好地利用Transformer的潜力进行SOD,我们提出了一种新参数高效的微调方法,旨在减少训练参数的数量,同时提高突出物体检测能力。我们的模型被称为External Prompt features Enhanced adapteR Tuning (ExPert),具有编码器-解码器结构,其中适配器模块将预训练的骨干网络调整为SOD,而注入器模块则包含外部提示特征,以增强对突出物体的意识。全面的实验证明了我们方法的优势。在五个SOD数据集上,ExPert超越了最先进的(SOTA)模型。在ECSSD数据集上,ExPert具有80.2M个训练参数,比基于Transformer的SOTA模型好21%,比基于CNN的SOTA模型好47%。
https://arxiv.org/abs/2404.15008
Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.
通过Transformer架构获得的解释,以原始注意形式表示,可以看作是一个类无关的显著性图。此外,基于注意的池化作为一种对特征空间进行遮蔽的形式。为了实现这个目标,我们设计了一个基于注意的池化机制,旨在在推理过程中取代全局平均池化(GAP)。这个机制被称为跨注意流(CA-Stream),它包括一系列与不同网络深度的特征交互的跨注意块。CA-Stream提高了模型的可解释性,同时保留了识别性能。
https://arxiv.org/abs/2404.14996
Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at this https URL.
近年来,无监督显著物体检测(USOD)因其无需标注的特点而受到越来越多的关注。然而,目前的方法主要关注特定任务,如RGB和RGB-D,而忽略了潜在的任务迁移可能性。在本文中,我们提出了一种通用的USOD框架,用于通用USOD任务。首先,我们提出了一种基于渐进式课程学习的不透明度蒸馏(PCL-SD)机制,从预训练的深度网络中提取 saliency 线索。这种机制从容易的样本开始,逐渐转移到困难的样本,以避免由困难样本引起的初始干扰。接下来,获得的 saliency 线索用于训练 saliency 检测器,并采用自校正伪标签优化(SPR)机制来提高伪标签的质量。最后,一种适配器调整方法被提出,以转移获得的 saliency 知识,利用共享知识在目标任务上实现卓越的传输性能。在五个代表性SOD任务上进行的大量实验证实了我们提出方法的有效性和可行性。代码和补充材料可在此链接下载:https://www.example.com/
https://arxiv.org/abs/2404.14759
Visual highlighting can guide user attention in complex interfaces. However, its effectiveness under limited attentional capacities is underexplored. This paper examines the joint impact of visual highlighting (permanent and dynamic) and dual-task-induced cognitive load on gaze behaviour. Our analysis, using eye-movement data from 27 participants viewing 150 unique webpages reveals that while participants' ability to attend to UI elements decreases with increasing cognitive load, dynamic adaptations (i.e., highlighting) remain attention-grabbing. The presence of these factors significantly alters what people attend to and thus what is salient. Accordingly, we show that state-of-the-art saliency models increase their performance when accounting for different cognitive loads. Our empirical insights, along with our openly available dataset, enhance our understanding of attentional processes in UIs under varying cognitive (and perceptual) loads and open the door for new models that can predict user attention while multitasking.
视觉突出可以帮助用户在复杂界面上集中注意力。然而,在有限注意力的情况下,其效果尚未得到充分探讨。本文研究了视觉突出(永久和动态)和双任务诱导的认知负荷对注意力的影响。通过对27个参与者在150个独特网页上的眼动数据进行分析,我们发现,随着认知负载的增加,参与者的注意力能力减弱,但动态适应(即突出)仍然引人注目。这些因素的存在显著地改变了人们关注的内容,从而使其显著。因此,我们证明了,当考虑不同认知负载时,最先进的凸出模型会增加其性能。我们的实证研究结果,加上我们公开可用的数据集,加强了我们对在不同的认知(和感知)负载下UI注意力的理解,并为预测用户在多任务处理中的注意力的新的模型铺平了道路。
https://arxiv.org/abs/2404.14232
Graph neural networks (GNNs) have revolutionized the field of machine learning on non-Euclidean data such as graphs and networks. GNNs effectively implement node representation learning through neighborhood aggregation and achieve impressive results in many graph-related tasks. However, most neighborhood aggregation approaches are summation-based, which can be problematic as they may not be sufficiently expressive to encode informative graph structures. Furthermore, though the graph pooling module is also of vital importance for graph learning, especially for the task of graph classification, research on graph down-sampling mechanisms is rather limited. To address the above challenges, we propose a concatenation-based graph convolution mechanism that injectively updates node representations to maximize the discriminative power in distinguishing non-isomorphic subgraphs. In addition, we design a novel graph pooling module, called WL-SortPool, to learn important subgraph patterns in a deep-learning manner. WL-SortPool layer-wise sorts node representations (i.e. continuous WL colors) to separately learn the relative importance of subtrees with different depths for the purpose of classification, thus better characterizing the complex graph topology and rich information encoded in the graph. We propose a novel Subgraph Pattern GNN (SPGNN) architecture that incorporates these enhancements. We test the proposed SPGNN architecture on many graph classification benchmarks. Experimental results show that our method can achieve highly competitive results with state-of-the-art graph kernels and other GNN approaches.
图形神经网络(GNNs)在非欧氏数据(如图形和网络)领域已经颠覆了机器学习。GNNs通过聚类和实现节点表示学习有效地实现了节点表示学习,并在许多图形相关任务中取得了令人印象深刻的成果。然而,大多数聚类方法是基于求和的,这可能会有问题,因为他们可能不足以编码有用的图形结构。此外,尽管图形池化模块对于图形学习(尤其是图形分类)也非常重要,但关于图形 down-sampling机制的研究仍然相当有限。为了应对上述挑战,我们提出了一个基于连接的图形卷积机制,通过注入式更新节点表示以最大化区分类别的 discriminative power。此外,我们还设计了一个名为WL-SortPool的新颖图形池化模块,以在深度学习的方式学习中学习重要的子图模式。WL-SortPool对节点表示(即连续的WL颜色)进行层间排序,以分别学习具有不同深度的子树之间的相对重要性,从而更好地描述复杂的图形拓扑结构和图中所编码的丰富信息。我们提出了一个包含这些增强的全新的子图模式图形神经网络(SPGNN)架构。我们在许多图形分类基准上测试了所提出的SPGNN架构。实验结果表明,我们的方法可以与最先进的图形核和其他GNN方法一样实现高度竞争性的结果。
https://arxiv.org/abs/2404.13655
To address the challenges of providing quick and plausible explanations in Explainable AI (XAI) for object detection models, we introduce the Gaussian Class Activation Mapping Explainer (G-CAME). Our method efficiently generates concise saliency maps by utilizing activation maps from selected layers and applying a Gaussian kernel to emphasize critical image regions for the predicted object. Compared with other Region-based approaches, G-CAME significantly reduces explanation time to 0.5 seconds without compromising the quality. Our evaluation of G-CAME, using Faster-RCNN and YOLOX on the MS-COCO 2017 dataset, demonstrates its ability to offer highly plausible and faithful explanations, especially in reducing the bias on tiny object detection.
为了在 Explainable AI (XAI) 中提供快速和明确的解释,我们引入了 Gaussian Class Activation Mapping Explainer (G-CAME)。我们的方法通过利用预选层中的激活图并应用高斯核强调预测物体的关键图像区域,有效地生成简洁的轮廓图。与基于区域的其他方法相比,G-CAME 在不牺牲质量的情况下显著减少了解释时间至0.5秒。我们对G-CAME 在 MS-COCO 2017 数据集上的评估表明,它具有提供高度可信和准确解释的能力,尤其是在减小微小物体检测中的偏差方面。
https://arxiv.org/abs/2404.13417
Interpreting and understanding the predictions made by deep learning models poses a formidable challenge due to their inherently opaque nature. Many previous efforts aimed at explaining these predictions rely on input features, specifically, the words within NLP models. However, such explanations are often less informative due to the discrete nature of these words and their lack of contextual verbosity. To address this limitation, we introduce the Latent Concept Attribution method (LACOAT), which generates explanations for predictions based on latent concepts. Our founding intuition is that a word can exhibit multiple facets, contingent upon the context in which it is used. Therefore, given a word in context, the latent space derived from our training process reflects a specific facet of that word. LACOAT functions by mapping the representations of salient input words into the training latent space, allowing it to provide predictions with context-based explanations within this latent space.
由于深度学习模型的固有透明度,解释它们所做的预测是一项具有挑战性的任务。许多以前的努力都试图解释这些预测,特别是基于输入特征,特别是这些NLP模型中的单词。然而,这样的解释通常不够有用,因为这些单词的离散性和缺乏上下文丰富性。为了克服这一局限,我们引入了潜在概念归因方法(LACOAT),它基于潜在概念生成解释。我们奠基的直觉是,一个单词可以根据其在上下文中的使用表现出多种特征。因此,给定一个单词,由我们的训练过程生成的潜在空间反映了该单词的一个特定方面。LACOAT通过将显著输入单词的代表映射到训练的潜在空间中,使它在潜在空间中提供基于上下文的预测解释。
https://arxiv.org/abs/2404.12545
The proliferation of applications using artificial intelligence (AI) systems has led to a growing number of users interacting with these systems through sophisticated interfaces. Human-computer interaction research has long shown that interfaces shape both user behavior and user perception of technical capabilities and risks. Yet, practitioners and researchers evaluating the social and ethical risks of AI systems tend to overlook the impact of anthropomorphic, deceptive, and immersive interfaces on human-AI interactions. Here, we argue that design features of interfaces with adaptive AI systems can have cascading impacts, driven by feedback loops, which extend beyond those previously considered. We first conduct a scoping review of AI interface designs and their negative impact to extract salient themes of potentially harmful design patterns in AI interfaces. Then, we propose Design-Enhanced Control of AI systems (DECAI), a conceptual model to structure and facilitate impact assessments of AI interface designs. DECAI draws on principles from control systems theory -- a theory for the analysis and design of dynamic physical systems -- to dissect the role of the interface in human-AI systems. Through two case studies on recommendation systems and conversational language model systems, we show how DECAI can be used to evaluate AI interface designs.
人工智能(AI)系统的应用程序的普及导致越来越多的用户通过复杂的界面与这些系统互动。人机交互研究已经证明,界面不仅塑造了用户的行為,还塑造了用户对技术能力和风险的认知。然而,评估AI系统的社会和道德风险的实践者和研究人员往往忽视了类人形、欺骗性和沉浸式界面对人类-AI互动的影响。在这里,我们认为具有自适应AI系统的界面设计特征可能会产生级联影响,这种影响超越了之前考虑的范围。我们首先对AI界面设计进行了范围审查,以提取可能对AI界面产生有害设计模式的主题。然后,我们提出了设计增强控制AI系统(DECAI)的概念模型,用于结构和促进AI界面设计的影響評估。DECAI借鉴了控制理论--一种用于分析和管理动态物理系统的理论--来阐明界面在人类-AI系统中的作用。通过推荐系统和会话语言模型系统的两个案例研究,我们展示了如何使用DECAI评估AI界面设计。
https://arxiv.org/abs/2404.11370
Inquisitive questions -- open-ended, curiosity-driven questions people ask as they read -- are an integral part of discourse processing (Kehler and Rohde, 2017; Onea, 2016) and comprehension (Prince, 2004). Recent work in NLP has taken advantage of question generation capabilities of LLMs to enhance a wide range of applications. But the space of inquisitive questions is vast: many questions can be evoked from a given context. So which of those should be prioritized to find answers? Linguistic theories, unfortunately, have not yet provided an answer to this question. This paper presents QSALIENCE, a salience predictor of inquisitive questions. QSALIENCE is instruction-tuned over our dataset of linguist-annotated salience scores of 1,766 (context, question) pairs. A question scores high on salience if answering it would greatly enhance the understanding of the text (Van Rooy, 2003). We show that highly salient questions are empirically more likely to be answered in the same article, bridging potential questions (Onea, 2016) with Questions Under Discussion (Roberts, 2012). We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
好奇的问题 -- 开放性的、以好奇心为导向的问题,人们在阅读中提出的问题 -- 是语义处理(Kehler和Rohde,2017;Onea,2016)和理解(Prince,2004)的重要组成部分。近年来,自然语言处理(NLP)工作充分利用了大型语言模型的问句生成能力,增强了广泛的应用。但是,好奇的问题的空间是广阔的:可以从给定的上下文中引发许多问题。那么,应该优先考虑哪些问题来寻找答案呢?不幸的是,语言理论尚未回答这个问题。本文介绍了 QSALIENCE,一个好奇问题预测器。QSALIENCE 是通过我们数据集中的1766个(上下文,问题)对进行语言学家标注的语义分数进行指令调整的。问题得分高,如果回答它会大大增强对文本的理解(Van Rooy,2003)。我们证明了,高度耸人听闻的问题在实证上更有可能在相同的文章中被回答,将潜在问题(Onea,2016)与正在讨论的问题(Roberts,2012)联系起来。我们进一步验证了我们的研究结果,通过展示回答耸人听闻的问题是新闻摘要质量的指标,来进一步验证我们的发现。
https://arxiv.org/abs/2404.10917
The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. It has been used previously in scene exploration tasks. In this paper, we revisit the model and extend its application to visual search tasks. To illustrate the benefits of using semantic information in scene exploration and visual search tasks, we compare its performance against traditional saliency-based models. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model in accurately representing the semantic information present in the visual scene. In visual search experiments, searching for instances of a target class in a visual field containing multiple distractors shows superior performance compared to the saliency-driven model and a random gaze selection algorithm. Our results demonstrate that semantic information, from the top-down, influences visual exploration and search tasks significantly, suggesting a potential area of research for integrating it with traditional bottom-up cues.
本文旨在探讨一个最近基于语义的信息提取的视野主动感知模型的准确性和其在完成人类通常执行的视觉任务(场景探索和视觉搜索)方面的能力。该模型利用当前物体检测器定位和分类大量物体类别的功能,并在多个注视点上更新场景的语义描述。它在之前用于场景探索任务中已经应用过。在本文中,我们重新审视了该模型,并将其应用于视觉搜索任务。为了说明在场景探索和视觉搜索任务中使用语义信息的优势,我们将其性能与传统基于 saliency 的模型进行比较。在场景探索任务中,基于语义的方法在准确表示视觉场景中的语义信息方面表现出优越性能。在视觉搜索实验中,在包含多个干扰物的视觉区域内搜索目标类别的实例,与基于 saliency 的模型和随机注视选择算法相比,表现出优越性能。我们的结果表明,从上到下,语义信息对视觉探索和搜索任务具有显著影响,这表明了一个可能的研究领域,将语义信息与传统自上而下提示相结合。
https://arxiv.org/abs/2404.10836
Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
在包括创意设计和电子商务在内的各个领域中,生成突出物场景对突出物在场景中的表现和上下文至关重要。通过将对象整合到定制环境中,可以增强主题的表现和上下文。生成背景的过程可以看作是一个文本条件下的修复绘画任务,其目标是将图像内容扩展到突出物的边界之外。尽管引导文本修复绘图模型(例如)也可以通过遮罩反向填充进行修复,但它们通过填充图像的缺失部分来修复图像,而不是将物体放入场景中。因此,当用于背景生成时,修复绘图模型经常扩展突出物的边界,从而改变物体的身份,这种现象我们称之为“物体膨胀”。本文介绍了一个使用Stable Diffusion和ControlNet架构将修复扩散模型适应突出物修复任务的模型。我们在模型和数据集上展示了的一系列定性和定量结果,包括一个不需要任何人类标注的新指标来衡量物体膨胀。与Stable Diffusion 2.0修复绘图相比,我们提出的方法在多个数据集上的标准视觉指标上减少了3.6倍的物体膨胀。
https://arxiv.org/abs/2404.10157
Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
基于扩散模型的视频生成研究进展迅速。然而,对象保真度和生成长度的限制使得其应用受限。此外,需要无缝循环的特定领域(如动画壁纸)要求第一和最后一帧的视频顺畅播放。为了应对这些挑战,本文提出了一种名为LoopAnimate的新方法,用于生成具有一致开始和结束帧的视频。为了提高对象保真度,我们引入了一个框架,将多层图像出现和文本语义信息解耦。基于图像到图像扩散模型,我们的方法从输入图像中引入了像素级和特征级信息,并在扩散模型的不同位置注入图像外观和文本语义嵌入。现有的UNet-based视频生成模型在训练过程中需要输入整个视频。然而,由于GPU内存的限制,通常帧数限制为16。为了克服这一挑战,本文提出了一种三阶段训练策略,其中帧数逐渐增加,并减少细调模块。此外,我们引入了Temporal Enhanced Motion Module(TEMM),以扩展编码时间和空间信息的能力,达到36帧。所提出的LoopAnimate方法,第一次将UNet-based视频生成模型的单过道生成长度扩展到35帧,同时保持高质量的视频生成。实验证明,LoopAnimate在客观指标(如保真度和时间一致性)和主观评价结果方面实现了最先进的性能。
https://arxiv.org/abs/2404.09172
Learning the skill of human bimanual grasping can extend the capabilities of robotic systems when grasping large or heavy objects. However, it requires a much larger search space for grasp points than single-hand grasping and numerous bimanual grasping annotations for network learning, making both data-driven or analytical grasping methods inefficient and insufficient. We propose a framework for bimanual grasp saliency learning that aims to predict the contact points for bimanual grasping based on existing human single-handed grasping data. We learn saliency corresponding vectors through minimal bimanual contact annotations that establishes correspondences between grasp positions of both hands, capable of eliminating the need for training a large-scale bimanual grasp dataset. The existing single-handed grasp saliency value serves as the initial value for bimanual grasp saliency, and we learn a saliency adjusted score that adds the initial value to obtain the final bimanual grasp saliency value, capable of predicting preferred bimanual grasp positions from single-handed grasp saliency. We also introduce a physics-balance loss function and a physics-aware refinement module that enables physical grasp balance, capable of enhancing the generalization of unknown objects. Comprehensive experiments in simulation and comparisons on dexterous grippers have demonstrated that our method can achieve balanced bimanual grasping effectively.
学习人类双 manual抓取技能可以扩展机器人系统在抓取大型或重物时的能力。然而,它需要比单手抓取和大量的双手抓取注释更大的抓点搜索空间,使得数据驱动或分析性抓取方法低效和不足。我们提出了一个双手抓取局部注意力学习框架,旨在根据现有的人类单手抓取数据预测双手抓取的接触点。我们通过最小程度的双手抓取注释学习相应的局部重要性向量,建立了手部抓取位置之间的对应关系,能够消除训练大规模双手抓取数据集的需求。现有的单手抓取局部重要性值作为双手抓取局部重要性的初始值,我们学习了一个局部重要性调整得分,将初始值加起来以获得最终双手抓取局部重要性值,能够预测从单手抓取局部重要性中预测更喜欢的手部抓取位置。我们还引入了物理平衡损失函数和物理感知平滑模块,使得手部平衡得以实现,能够增强对未知物体的泛化能力。通过仿真实验和双手灵巧爪器的比较,我们的方法可以有效实现平衡双手抓取。
https://arxiv.org/abs/2404.08944
In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-effective approach that significantly enhances the pre-training performance of Masked Image Modeling (MIM) approaches by prioritizing token salience. Our method provides robustness against variations in masking ratios, effectively mitigating the performance instability issues common in existing methods. This relaxes the sensitivity of MIM-based pre-training to masking ratios, which in turn allows us to propose an adaptive strategy for `tailored' masking ratios for each data sample, which no existing method can provide. Toward this goal, we propose an Adaptive Masking Ratio (AMR) strategy that dynamically adjusts the proportion of masking for the unique content of each image based on token salience. We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.
在本文中,我们提出了Saliency-Based Adaptive Masking(SBAM)方法,一种新颖且成本效益高的方法,它通过优先考虑token的 salience显著增强了基于掩码图像建模(MIM)方法的预训练性能。我们的方法对于掩码比的变化具有鲁棒性,有效缓解了现有方法中常见的性能不稳定问题。这使得基于MIM的预训练对掩码比的变化更加敏感,进而允许我们为每个数据样本提出自适应的掩码比策略,而现有的方法无法实现。为此,我们提出了一个自适应掩码比(AMR)策略,根据token的 salience动态调整每个图像的掩码比例。我们证明了我们的方法在ImageNet-1K数据集上的基于掩码的预训练方面显著优于现有方法。
https://arxiv.org/abs/2404.08327