Deep learning models often encounter challenges in making accurate inferences when there are domain shifts between the source and target data. This issue is particularly pronounced in clinical settings due to the scarcity of annotated data resulting from the professional and private nature of medical data. Despite the existence of decent solutions, many of them are hindered in clinical settings due to limitations in data collection and computational complexity. To tackle domain shifts in data-scarce medical scenarios, we propose a Random frequency filtering enabled Single-source Domain Generalization algorithm (RaffeSDG), which promises robust out-of-domain inference with segmentation models trained on a single-source domain. A filter-based data augmentation strategy is first proposed to promote domain variability within a single-source domain by introducing variations in frequency space and blending homologous samples. Then Gaussian filter-based structural saliency is also leveraged to learn robust representations across augmented samples, further facilitating the training of generalizable segmentation models. To validate the effectiveness of RaffeSDG, we conducted extensive experiments involving out-of-domain inference on segmentation tasks for three human tissues imaged by four diverse modalities. Through thorough investigations and comparisons, compelling evidence was observed in these experiments, demonstrating the potential and generalizability of RaffeSDG. The code is available at this https URL.
深度学习模型在目标数据和源数据之间存在领域转移时,通常会做出准确的推理。这个问题在临床环境中尤为突出,因为医疗数据的非专业和私有的性质导致缺乏注释数据。尽管存在可行的解决方案,但它们在临床环境中因数据收集和计算复杂性的限制而受到阻碍。为解决数据稀少的医疗场景中的领域转移问题,我们提出了一个由随机频率过滤启发的单源领域泛化算法(RaffeSDG),它承诺在单源领域上训练的分割模型的稳健离域推理能力。首先提出了一种基于滤波器的数据增强策略,通过引入频率空间中的变化和混合同样样本来提高领域差异。然后,利用高斯滤波器进行结构重要性学习,以学习增强样本中的稳健表示,进一步推动泛化分割模型的训练。为了验证RaffeSDG的有效性,我们进行了涉及四个不同模态成像的人体三种组织的分割任务的大量实验。通过深入调查和比较,在这些实验中观察到了令人信服的证据,证明了RaffeSDG的潜力和泛化能力。代码可在此处访问:https://www.kaggle.com/raffeym/raffeymdg
https://arxiv.org/abs/2405.01228
Different from the context-independent (CI) concepts such as human, car, and airplane, context-dependent (CD) concepts require higher visual understanding ability, such as camouflaged object and medical lesion. Despite the rapid advance of many CD understanding tasks in respective branches, the isolated evolution leads to their limited cross-domain generalisation and repetitive technique innovation. Since there is a strong coupling relationship between foreground and background context in CD tasks, existing methods require to train separate models in their focused domains. This restricts their real-world CD concept understanding towards artificial general intelligence (AGI). We propose a unified model with a single set of parameters, Spider, which only needs to be trained once. With the help of the proposed concept filter driven by the image-mask group prompt, Spider is able to understand and distinguish diverse strong context-dependent concepts to accurately capture the Prompter's intention. Without bells and whistles, Spider significantly outperforms the state-of-the-art specialized models in 8 different context-dependent segmentation tasks, including 4 natural scenes (salient, camouflaged, and transparent objects and shadow) and 4 medical lesions (COVID-19, polyp, breast, and skin lesion with color colonoscopy, CT, ultrasound, and dermoscopy modalities). Besides, Spider shows obvious advantages in continuous learning. It can easily complete the training of new tasks by fine-tuning parameters less than 1\% and bring a tolerable performance degradation of less than 5\% for all old tasks. The source code will be publicly available at \href{this https URL}{Spider-UniCDSeg}.
与上下文无关(CI)概念(如人类、汽车和飞机)不同,上下文依赖(CD)概念需要更高的视觉理解能力,例如伪装物体和医学病变。尽管许多CD理解任务在各自领域中迅速发展,但孤立的进化导致它们的跨领域泛化有限,技术创新重复。由于CD任务中前景与背景上下文之间存在强烈的耦合关系,现有方法需要在其关注领域上训练单独的模型。这限制了它们在现实世界CD概念理解方面的能力,将人工智能(AGI)推向极限。我们提出了一种统一模型Spider,它只需要一次训练。通过图像掩码组提示的观念过滤,Spider能够理解和区分各种强烈的上下文依赖概念,准确捕捉提示者的意图。在没有花言巧语的情况下,Spider在8个不同的上下文相关分割任务中的表现显著超过了最先进的专用模型,包括4个自然场景(显眼、伪装和透明物体和影子)和4个医学病变(COVID-19、疣、乳腺癌和皮肤病变彩色内窥镜、CT、超声和皮肤镜检)。此外,Spider在连续学习方面表现出明显优势。它可以通过对参数的微调轻松完成对新任务的训练,并为所有老任务带来不超过5%的容差性能下降。源代码将公开发布在\href{this <https://this URL>}{Spider-UniCDSeg}。
https://arxiv.org/abs/2405.01002
Radiology report generation aims to automatically generate detailed and coherent descriptive reports alongside radiology images. Previous work mainly focused on refining fine-grained image features or leveraging external knowledge. However, the precise alignment of fine-grained image features with corresponding text descriptions has not been considered. This paper presents a novel method called Fine-grained Image-Text Aligner (FITA) to construct fine-grained alignment for image and text features. It has three novel designs: Image Feature Refiner (IFR), Text Feature Refiner (TFR) and Contrastive Aligner (CA). IFR and TFR aim to learn fine-grained image and text features, respectively. We achieve this by leveraging saliency maps to effectively fuse symptoms with corresponding abnormal visual regions, and by utilizing a meticulously constructed triplet set for training. Finally, CA module aligns fine-grained image and text features using contrastive loss for precise alignment. Results show that our method surpasses existing methods on the widely used benchmark
影像报告生成旨在自动生成详细的、连贯的描述性报告,并与影像图像一起显示。之前的工作主要集中在改进细粒度图像特征或利用外部知识。然而,没有考虑细粒度图像特征与相应文本描述的精确对齐。本文提出了一种名为细粒度图像-文本对齐器(FITA)的新方法来构建图像和文本特征的细粒度对齐。它具有三个新颖的设计:图像特征优化器(IFR)、文本特征优化器(TFR)和对比对齐器(CA)。IFR和TFR旨在分别学习细粒度图像和文本特征。我们通过利用置信度图有效地将症状与相应的异常视觉区域相结合,并使用精心构建的三元组集进行训练来实现这一目标。最后,CA模块通过对比损失对细粒度图像和文本特征进行精确对齐。结果表明,我们的方法在广泛使用的基准测试中都超过了现有方法。
https://arxiv.org/abs/2405.00962
Incorporating human-perceptual intelligence into model training has shown to increase the generalization capability of models in several difficult biometric tasks, such as presentation attack detection (PAD) and detection of synthetic samples. After the initial collection phase, human visual saliency (e.g., eye-tracking data, or handwritten annotations) can be integrated into model training through attention mechanisms, augmented training samples, or through human perception-related components of loss functions. Despite their successes, a vital, but seemingly neglected, aspect of any saliency-based training is the level of salience granularity (e.g., bounding boxes, single saliency maps, or saliency aggregated from multiple subjects) necessary to find a balance between reaping the full benefits of human saliency and the cost of its collection. In this paper, we explore several different levels of salience granularity and demonstrate that increased generalization capabilities of PAD and synthetic face detection can be achieved by using simple yet effective saliency post-processing techniques across several different CNNs.
将人类感知智能融入模型训练已经在多项困难的生物特征任务中增加了模型的泛化能力,例如展示攻击检测(PAD)和合成样本检测。在初始收集阶段,人类视觉突出(例如,眼跟踪数据或手写注释)可以通过关注机制、增强训练样本或通过损失函数中的人感知相关组件进行整合。尽管它们取得了成功,但任何基于突显的训练中一个似乎被忽视的重要方面是所需的突显粒度水平(例如,边界框、单个突显图或来自多个对象的突显聚合)。在本文中,我们探讨了几个不同的突显粒度水平,并证明了通过使用简单而有效的突显后处理技术,可以在多个不同的卷积神经网络中实现PAD和合成样本检测的泛化能力的提高。
https://arxiv.org/abs/2405.00650
Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.
基于视觉的端到端自动驾驶模型通过模仿学习进行训练可以实现自动驾驶的实惠解决方案。然而,为这些表现优异的模型进行训练通常需要大量数据,同时在驾驶过程中仍缺乏明确的激活图来揭示这些模型的内在工作原理。在本文中,我们研究了如何通过在训练过程中添加一个损失项来引导模型的注意,从而提高其驾驶质量和获得更直观的激活图。与之前的工作不同,我们的方法不需要在测试时具有显眼的语义图,也不需要修改应用的模型架构。我们使用完美和噪声明显的语义图进行了测试,具有鼓舞人心的结果,尤其是后者受到真实数据中可能出现的错误的启发。使用CIL++作为具有代表性的最先进模型和CARLA仿真器的标准基准,我们进行了实验,证明了我们的方法在训练资源有限的情况下可以有效提高自动驾驶模型的效果。
https://arxiv.org/abs/2405.00242
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
许多现有的运动预测方法依赖于符号感知输出生成代理程序轨迹,例如边界框、道路图信息和交通灯。这种符号表示是对现实世界的高层次抽象,这可能导致运动预测模型在感知错误(例如未能检测到开放词汇障碍)时变得脆弱,同时场景上下文中的显着信息缺失(例如道路条件差)也可能会导致问题。一种替代的范式是从原始传感器进行端到端学习。然而,这种方法存在可解释性差和需要大量训练资源的问题。在这项工作中,我们提出将视觉世界划分为一系列场景元素,然后利用预训练的图像基础模型和激光雷达神经网络以开放词汇的方式编码所有场景元素。图像基础模型使我们场景的像素级表示具有通用知识,而激光雷达神经网络则编码几何信息。我们提出的表示能够以极少的几个 hundred 个标记语义单元高效地编码多帧多模态观察。我们的方法与大多数基于Transformer的架构兼容。为了评估我们的方法,我们在Waymo Open Motion Dataset上用相机嵌入进行了增强。在Waymo Open Motion Dataset上的实验表明,我们的方法在相较于最先进方法的情况下取得了显著的性能提升。
https://arxiv.org/abs/2404.19531
Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
尽管知识图谱(KGs)在各种任务中的广泛应用,如问答和智能对话系统,现有KG面临两个主要挑战:信息粒度和时间不足。这些阻碍了从KGs中检索和分析上下文、细粒度和最新知识的能力,特别是在高度专业化的主题(例如,专业科学研究)和快速变化的环境(例如,新闻或灾害跟踪)中。为了应对这些挑战,我们提出了一个主题特定知识图(即 ThemeKG),一个基于主题特定语料库的知识图谱,并设计了用于 ThemeKG 构建的无监督框架(名为 TKGCon)。该框架从主题特定语料库中提取原始主题,然后通过大型语言模型(LLMs)生成候选关系,构建主题关系本体。为了解析主题语料库中的文档,我们首先将提取到的实体对映射到语料库,并检索候选关系。最后,我们将上下文和本体整合用于关系匹配。我们观察到,直接使用 GPT-4 生成主题特定 KG会导致不准确实体(例如查询结果中的“两个主要类型”作为一个实体),以及不清晰或错误的關係(例如“由於”或“开始于”)。相比之下,通过逐步构建主题特定 KG,我们的模型在比较各种 KG 建设基线方面表现出优异性能。实验结果还显示,我们的框架在各种 KG 建设基线上的评估中表现出色。
https://arxiv.org/abs/2404.19146
Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
深度学习算法缺乏对它们如何将原始视觉输入转换为稳健的语义理解进行人类可解释的说明,这阻碍了不同架构、训练目标和人类大脑之间的比较。在本文中,我们从神经科学中汲取灵感,并采用表示方法阐明了神经网络在低(视觉显着性)和高(语义相似性)抽象水平上编码信息的过程。此外,我们还引入了一个自定义图像数据集,我们系统地操纵显着性和语义信息。我们发现,在以物体分类为目标训练ResNets时,ResNets对显着性信息的敏感性比ViTs更高。我们发现,在ResNets中,网络在早期层抑制显着性信息,这一过程通过自然语言监督(CLIP)得到了增强。CLIP还增强了两种架构的语义编码。最后,我们证明了语义编码是使AI与人类视觉感知保持一致的关键因素,而显着性抑制是一种非脑部策略。
https://arxiv.org/abs/2404.18772
While Deep Reinforcement Learning (DRL) has emerged as a promising solution for intricate control tasks, the lack of explainability of the learned policies impedes its uptake in safety-critical applications, such as automated driving systems (ADS). Counterfactual (CF) explanations have recently gained prominence for their ability to interpret black-box Deep Learning (DL) models. CF examples are associated with minimal changes in the input, resulting in a complementary output by the DL model. Finding such alternations, particularly for high-dimensional visual inputs, poses significant challenges. Besides, the temporal dependency introduced by the reliance of the DRL agent action on a history of past state observations further complicates the generation of CF examples. To address these challenges, we propose using a saliency map to identify the most influential input pixels across the sequence of past observed states by the agent. Then, we feed this map to a deep generative model, enabling the generation of plausible CFs with constrained modifications centred on the salient regions. We evaluate the effectiveness of our framework in diverse domains, including ADS, Atari Pong, Pacman and space-invaders games, using traditional performance metrics such as validity, proximity and sparsity. Experimental results demonstrate that this framework generates more informative and plausible CFs than the state-of-the-art for a wide range of environments and DRL agents. In order to foster research in this area, we have made our datasets and codes publicly available at this https URL.
尽管深度强化学习(DRL)已成为解决复杂控制任务的的有前景的解决方案,但所学习的策略的可解释性不足,这使得其在关键应用领域(如自动驾驶系统)中的应用受到限制。近年来,由于CF解释器能够解释黑盒深度学习(DL)模型的能力,它们在DL模型的可解释性方面受到了越来越多的关注。CF示例与输入的变化量较小,从而使DL模型具有互补的输出。在解决这种问题之前,尤其是在高维视觉输入的情况下,找到这样的变化是非常具有挑战性的。此外,DL代理商动作对过去状态观察历史依赖所引入的时间依赖性,进一步复杂了CF示例的生成。为了应对这些挑战,我们提出了使用局部重要性图来确定代理在整个过去观察状态序列中的最具影响力输入像素的方法。然后,我们将这个地图输入到深度生成模型中,使得模型的输出在显著区域上进行约束修改,从而生成合理的CF。我们使用传统性能度量标准(如有效性、接近度和稀疏性)评估我们在各种领域的框架的有效性。实验结果表明,我们的框架在广泛的環境和DRL代理商中产生了比现有状态更好的信息量和合理的CF。为了促进该领域的研究,我们将数据集和代码公开发布在https://这个链接上。
https://arxiv.org/abs/2404.18326
Dual-arm robots have great application prospects in intelligent manufacturing due to their human-like structure when deployed with advanced intelligence algorithm. However, the previous visuomotor policy suffers from perception deficiencies in environments where features of images are impaired by the various conditions, such as abnormal lighting, occlusion and shadow etc. The Focal CVAE framework is proposed for RGB-D multi-modal data fusion to address this challenge. In this study, a mixed focal attention module is designed for the fusion of RGB images containing color features and depth images containing 3D shape and structure information. This module highlights the prominent local features and focuses on the relevance of RGB and depth via cross-attention. A saliency attention module is proposed to improve its computational efficiency, which is applied in the encoder and the decoder of the framework. We illustrate the effectiveness of the proposed method via extensive simulation and experiments. It's shown that the performances of bi-manipulation are all significantly improved in the four real-world tasks with lower computational cost. Besides, the robustness is validated through experiments under different scenarios where there is a perception deficiency problem, demonstrating the feasibility of the method.
双臂机器人由于其具有先进智能算法部署时的人性化结构,在智能制造领域具有巨大的应用潜力。然而,在部署过程中,由于图像特征受到各种条件(如异常照明、遮挡和阴影等)的影响,前可视运动策略存在感知缺陷。为了应对这一挑战,我们提出了Focal CVAE框架来解决这一问题。 在本研究中,我们设计了一个混合焦注意模块,用于融合包含颜色特征的RGB图像和包含3D形状和结构信息的深度图像。该模块突出了显著的局部特征,并关注RGB和深度之间的跨注意关系。为了提高计算效率,我们提出了一个欠拟合注意力模块,该模块应用于框架的编码器和解码器。通过广泛的仿真和实验,我们证明了所提出方法的有效性。实验结果表明,在四个现实世界的任务中,双臂机器人的性能都有显著的提高,且计算成本较低。此外,通过在不同情景下的实验验证了方法的稳健性,证明了该方法的可行性。
https://arxiv.org/abs/2404.17811
In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at this https URL.
在本文中,我们提出了一个新颖的视觉语义-空间自突出网络(称为3SHNet)用于高精度、高效率和高通量的图像-句子检索。3SHNet突出了在视觉模态中突出物体的显著性识别和对其位置的突显,从而实现视觉语义-空间交互,并保持两个模式之间的独立性。这种集成有效地将分割得到的物体区域与相应的语义和位置布局相结合,增强了视觉表示。而模式独立性确保了效率和通用性。此外,3SHNet利用分割得到的结构化上下文视觉场景信息进行局部(基于区域)或全局(基于网格)指导,实现准确的半层次检索。在MS-COCO和Flicker30K基准上进行的大量实验证实了与当代最先进方法相比,所提出的3SHNet具有卓越的性能、推理效率和泛化能力。具体来说,在较大的MS-COCO 5K测试集中,我们分别实现了16.3%、24.8%和18.3%的rSum得分提升,与不同图像表示方法的最先进方法相比,保持最佳的检索效率。此外,跨数据集泛化性能提高了18.6%。数据和代码可在此链接处获取:https://www.acm.org/dl/doi/10.1145/28482-28487.28485-28486.28487-28487.28487
https://arxiv.org/abs/2404.17273
Clustering of motion trajectories is highly relevant for human-robot interactions as it allows the anticipation of human motions, fast reaction to those, as well as the recognition of explicit gestures. Further, it allows automated analysis of recorded motion data. Many clustering algorithms for trajectories build upon distance metrics that are based on pointwise Euclidean distances. However, our work indicates that focusing on salient characteristics is often sufficient. We present a novel distance measure for motion plans consisting of state and control trajectories that is based on a compressed representation built from their main features. This approach allows a flexible choice of feature classes relevant to the respective task. The distance measure is used in agglomerative hierarchical clustering. We compare our method with the widely used dynamic time warping algorithm on test sets of motion plans for the Furuta pendulum and the Manutec robot arm and on real-world data from a human motion dataset. The proposed method demonstrates slight advantages in clustering and strong advantages in runtime, especially for long trajectories.
运动轨迹聚类对于人机交互具有高度相关性,因为它允许预测人类动作、快速反应以及识别明确的手势。此外,它允许对记录的运动数据进行自动分析。许多轨迹聚类算法基于基于点间欧氏距离的距离度量。然而,我们的工作表明,关注显著特征通常是足够的。我们提出了一种基于主要特征的压缩表示的运动计划距离度量,为人机交互任务提供了一种灵活的特征类选择方式。该方法用于聚类分层。我们将我们的方法与用于 furua 摆和 manutec 机器人手臂的运动计划广泛使用的动态时间扭曲算法以及来自人类运动数据集的真实世界数据进行比较。与聚类算法相比,我们的方法在聚类方面略微优势,在运行时具有明显优势,特别是对于长轨迹。
https://arxiv.org/abs/2404.17269
Modern face recognition systems utilize deep neural networks to extract salient features from a face. These features denote embeddings in latent space and are often stored as templates in a face recognition system. These embeddings are susceptible to data leakage and, in some cases, can even be used to reconstruct the original face image. To prevent compromising identities, template protection schemes are commonly employed. However, these schemes may still not prevent the leakage of soft biometric information such as age, gender and race. To alleviate this issue, we propose a novel technique that combines Fully Homomorphic Encryption (FHE) with an existing template protection scheme known as PolyProtect. We show that the embeddings can be compressed and encrypted using FHE and transformed into a secure PolyProtect template using polynomial transformation, for additional protection. We demonstrate the efficacy of the proposed approach through extensive experiments on multiple datasets. Our proposed approach ensures irreversibility and unlinkability, effectively preventing the leakage of soft biometric attributes from face embeddings without compromising recognition accuracy.
现代面部识别系统利用深度神经网络从面部中提取显著特征。这些特征表示潜在空间中的嵌入,通常被面部识别系统中的模板存储。这些嵌入很容易受到数据泄漏的影响,在某些情况下,甚至可以用于重构原始面部图像。为了防止泄露身份,通常采用模板保护方案。然而,这些方案可能仍无法防止软生物特征(如年龄、性别和种族)的泄露。为了缓解这个问题,我们提出了一种结合完全同态加密(FHE)和已知模板保护方案(PolyProtect)的新技术。我们证明了使用FHE可以压缩和加密嵌入,并且可以使用多项式变换将其转换为安全的PolyProtect模板,提供额外的保护。我们通过在多个数据集上进行广泛实验,证明了所提出方法的有效性。与我们的方法相比,确保不可逆性和解链性,有效防止了未经过授权的软生物特征从面部嵌入中泄露,同时保持识别准确性。
https://arxiv.org/abs/2404.16255
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
This paper proposes a new gradient-based XAI method called Guided AbsoluteGrad for saliency map explanations. We utilize both positive and negative gradient magnitudes and employ gradient variance to distinguish the important areas for noise deduction. We also introduce a novel evaluation metric named ReCover And Predict (RCAP), which considers the Localization and Visual Noise Level objectives of the explanations. We propose two propositions for these two objectives and prove the necessity of evaluating them. We evaluate Guided AbsoluteGrad with seven gradient-based XAI methods using the RCAP metric and other SOTA metrics in three case studies: (1) ImageNet dataset with ResNet50 model; (2) International Skin Imaging Collaboration (ISIC) dataset with EfficientNet model; (3) the Places365 dataset with DenseNet161 model. Our method surpasses other gradient-based approaches, showcasing the quality of enhanced saliency map explanations through gradient magnitude.
本文提出了一种名为Guided AbsoluteGrad的新XAI方法,用于解释显著性图。我们利用正负梯度幅度并采用梯度方差来区分重要的噪声推断区域。我们还引入了一个名为ReCover And Predict(RCAP)的新评估指标,它考虑了解释的局部化和视觉噪声水平目标。我们提出了这两个目标的两个命题,并证明评估它们是必要的。我们在三个案例研究中使用RCAP和其他SOTA指标评估Guided AbsoluteGrad:(1)使用ResNet50模型的ImageNet数据集;(2)使用EfficientNet模型的ISIC数据集;(3)使用DenseNet161模型的Places365数据集。我们的方法超越了其他基于梯度的方法,通过梯度幅度展示了增强显著性图解释的质量。
https://arxiv.org/abs/2404.15564
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in ECSSD dataset with 80.2M trained parameters, 21% better than transformer-based SOTA model and 47% better than CNN-based SOTA model.
突出物体检测(SOD)旨在在图像和输出中找到最具突出的物体,并输出二进制掩码级像素级。基于Transformer的方法由于其全局语义理解,对于识别突出物体至关重要。然而,这些模型往往较大,并需要大量的训练参数。为了更好地利用Transformer的潜力进行SOD,我们提出了一种新参数高效的微调方法,旨在减少训练参数的数量,同时提高突出物体检测能力。我们的模型被称为External Prompt features Enhanced adapteR Tuning (ExPert),具有编码器-解码器结构,其中适配器模块将预训练的骨干网络调整为SOD,而注入器模块则包含外部提示特征,以增强对突出物体的意识。全面的实验证明了我们方法的优势。在五个SOD数据集上,ExPert超越了最先进的(SOTA)模型。在ECSSD数据集上,ExPert具有80.2M个训练参数,比基于Transformer的SOTA模型好21%,比基于CNN的SOTA模型好47%。
https://arxiv.org/abs/2404.15008