Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.
生成模型已经使得使用自然语言在图像上进行直观的创作和编辑。特别是,扩散模型最近在自然图像编辑方面取得了显著的成果。在这项工作中,我们提出将扩散技术应用于编辑纹理,纹理是3D内容创建流程的重要组成部分。我们分析现有的编辑方法,并表明它们不适用于纹理,因为它们的共同基础方法——操纵关注图——不适用于纹理领域。为了解决这个问题,我们提出了一种新颖的方法,即通过调控CLIP图像嵌入来控制扩散生成。我们使用简单的文本提示(例如," aged wood " 到 " new wood ")定义编辑方向,并使用纹理先验将这些方向映射到CLIP图像嵌入空间,采用基于采样的方法,在CLIP空间中给出与纹理属性无关的身份保持方向。为了进一步提高身份保留,我们将这些方向投影到CLIP子空间中,该子空间最小化由纠缠纹理属性引起的身份变化。我们的编辑流程使用自然语言提示仅创建任意滑块,无需标注的地面真数据。
https://arxiv.org/abs/2405.00672
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
组成图像检索(CIR)是一个复杂的任务,它使用查询来检索图像,该查询配置了一个图像和一个描述对图像所需修改的文本。监督的CIR方法已经展示了强大的性能,但它们依赖于昂贵的手动标注数据集,从而限制了它们的可扩展性和更广泛的适用性。为解决这些问题,以前的研究提出了基于伪词词向量的零 shots CIR(ZS-CIR)方法,该方法利用投影模块将图像映射到词向量。然而,我们推测这种方法的一个缺点是:投影模块扭曲了原始图像表示,并将所得组合嵌入限制在文本侧。为了解决这个问题,我们引入了一种新颖的ZS-CIR方法,该方法使用球面线性插值(Slerp)直接将图像和文本表示合并。此外,我们还引入了文本锚定调整(TAT)方法,该方法在保持文本编码器固定的情况下,对图像编码器进行微调。TAT缩小了图像和文本之间的模式差距,使得Slerp过程更加有效。值得注意的是,TAT方法不仅在训练数据规模和训练时间方面具有效率,而且还可以作为训练监督CIR模型的良好初始检查点,从而突出其更广泛的潜力。将Slerp-based ZS-CIR与TAT调整的模型相结合,使得我们的方法在CIR基准测试中实现了最先进的检索性能。
https://arxiv.org/abs/2405.00571
Image-level regression is an important task in Earth observation, where visual domain and label shifts are a core challenge hampering generalization. However, cross-domain regression with remote sensing data remains understudied due to the absence of suited datasets. We introduce a new dataset with aerial and satellite imagery in five countries with three forest-related regression tasks. To match real-world applicative interests, we compare methods through a restrictive setup where no prior on the target domain is available during training, and models are adapted with limited information during testing. Building on the assumption that ordered relationships generalize better, we propose manifold diffusion for regression as a strong baseline for transduction in low-data regimes. Our comparison highlights the comparative advantages of inductive and transductive methods in cross-domain regression.
图像级回归是地球观测中的一个重要任务,因为视觉领域和标签转移是泛化的关键挑战。然而,由于缺乏合适的数据集,跨领域回归与遥感数据仍然是一个研究不足的领域。我们介绍了一个包含五个国家航空和卫星图像的新数据集,包括三个与森林相关回归任务。为了满足现实应用的需求,我们在没有目标领域先前知识的情况下进行比较,并在测试期间使用有限的信息来适应模型。基于一个假设,有序关系具有更好的泛化能力,我们在低数据集环境下提出多维扩散作为转换的基础。我们的比较突出了归纳方法和转换方法在跨领域回归中的比较优势。
https://arxiv.org/abs/2405.00514
Prediction of road users' behaviors in the context of autonomous driving has gained considerable attention by the scientific community in the last years. Most works focus on predicting behaviors based on kinematic information alone, a simplification of the reality since road users are humans, and as such they are highly influenced by their surrounding context. In addition, a large plethora of research works rely on powerful Deep Learning techniques, which exhibit high performance metrics in prediction tasks but may lack the ability to fully understand and exploit the contextual semantic information contained in the road scene, not to mention their inability to provide explainable predictions that can be understood by humans. In this work, we propose an explainable road users' behavior prediction system that integrates the reasoning abilities of Knowledge Graphs (KG) and the expressiveness capabilities of Large Language Models (LLM) by using Retrieval Augmented Generation (RAG) techniques. For that purpose, Knowledge Graph Embeddings (KGE) and Bayesian inference are combined to allow the deployment of a fully inductive reasoning system that enables the issuing of predictions that rely on legacy information contained in the graph as well as on current evidence gathered in real time by onboard sensors. Two use cases have been implemented following the proposed approach: 1) Prediction of pedestrians' crossing actions; 2) Prediction of lane change maneuvers. In both cases, the performance attained surpasses the current state of the art in terms of anticipation and F1-score, showing a promising avenue for future research in this field.
近年来,自动驾驶背景下预测道路使用者的行为已经引起了科学界的广泛关注。大多数工作都基于运动信息预测行为,简化现实,因为道路使用者是是人,所以他们对周围环境的影响很大。此外,大量研究作品依赖强大的深度学习技术,在预测任务中表现出高的性能指标,但可能无法完全理解并利用道路场景中的上下文语义信息,更不用说无法提供可解释的预测,让人类能够理解。在本文中,我们提出了一个可解释的道路使用者行为预测系统,通过使用检索增强生成(RAG)技术将知识图谱的推理能力和大型语言模型的表现力相结合。为此,知识图谱嵌入(KGE)和贝叶斯推理被结合使用,以便部署一个完全归纳推理系统,该系统能够基于图形中的旧信息以及车载传感器实时收集的证据发出预测。以下是根据所提出的方法实现的两个用例:1)预测行人过马路的行为;2)预测车道变更操作。在这两个用例中,取得的性能已经超越了当前的技术水平,显示了该领域未来研究的希望。
https://arxiv.org/abs/2405.00449
In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at this https URL.
在本文中,我们研究了用于文档文本识别的自监督预训练方法。目前,可以收集许多研究任务的大规模未标注数据集,包括文本识别,但是 annotate它们成本高昂。因此,我们研究了利用未标注数据的方法。我们研究了基于遮罩标签预测的自监督预训练方法,使用了三种不同的方法 - 特征量化、VQ-VAE和后量化AE。我们还研究了使用VICReg和NT-Xent目标,以及我们提出的图像平移技术,以防止模型过拟合,其中模型仅依赖于位置编码而完全忽视输入图像。我们在历史手写(Bentham)和印刷数据集上进行实验,主要研究了不同数量标注目标领域数据的自监督预训练方法的益处。我们使用迁移学习作为强基线。评估显示,来自目标领域的数据的自监督预训练非常有效,但很难在相关领域中超越迁移学习。本文是第一个研究文档文本识别中自监督预训练的论文,我们认为它将成为未来研究领域的基石。我们将研究方法的实现公开发布在本文的https:// URL上。
https://arxiv.org/abs/2405.00420
Distinguished from traditional knowledge graphs (KGs), temporal knowledge graphs (TKGs) must explore and reason over temporally evolving facts adequately. However, existing TKG approaches still face two main challenges, i.e., the limited capability to model arbitrary timestamps continuously and the lack of rich inference patterns under temporal constraints. In this paper, we propose an innovative TKGE method (PTBox) via polynomial decomposition-based temporal representation and box embedding-based entity representation to tackle the above-mentioned problems. Specifically, we decompose time information by polynomials and then enhance the model's capability to represent arbitrary timestamps flexibly by incorporating the learnable temporal basis tensor. In addition, we model every entity as a hyperrectangle box and define each relation as a transformation on the head and tail entity boxes. The entity boxes can capture complex geometric structures and learn robust representations, improving the model's inductive capability for rich inference patterns. Theoretically, our PTBox can encode arbitrary time information or even unseen timestamps while capturing rich inference patterns and higher-arity relations of the knowledge base. Extensive experiments on real-world datasets demonstrate the effectiveness of our method.
与传统知识图(KGs)相比,时间知识图(TKGs)必须充分探索和推理随时间变化的事实。然而,现有的TKG方法仍然面临着两个主要挑战,即无法连续建模任意时间戳以及缺乏在时间约束下丰富的推理模式。在本文中,我们提出了一个创新的时间知识图(TKGE)方法(PTBox),通过基于多项式的时本表示和基于箱嵌入的实体表示来解决上述问题。具体来说,我们通过多项式分解来分解时间信息,然后通过学习的时间本张量增强模型的能力来表示任意时间戳。此外,我们将每个实体建模为一个超矩形框,将每个关系建模为一个变换,该变换作用于头和尾实体框。实体框可以捕捉复杂的几何结构,并学习稳健的表示,提高模型对于复杂推理模式的归纳能力。从理论上将我们的PTBox可以编码任意时间信息或甚至未知的时刻,同时捕捉知识库中的丰富推理模式和高阶关系。在现实世界的数据集上进行大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2405.00358
Temporal Knowledge Graph (TKG) reasoning often involves completing missing factual elements along the timeline. Although existing methods can learn good embeddings for each factual element in quadruples by integrating temporal information, they often fail to infer the evolution of temporal facts. This is mainly because of (1) insufficiently exploring the internal structure and semantic relationships within individual quadruples and (2) inadequately learning a unified representation of the contextual and temporal correlations among different quadruples. To overcome these limitations, we propose a novel Transformer-based reasoning model (dubbed ECEformer) for TKG to learn the Evolutionary Chain of Events (ECE). Specifically, we unfold the neighborhood subgraph of an entity node in chronological order, forming an evolutionary chain of events as the input for our model. Subsequently, we utilize a Transformer encoder to learn the embeddings of intra-quadruples for ECE. We then craft a mixed-context reasoning module based on the multi-layer perceptron (MLP) to learn the unified representations of inter-quadruples for ECE while accomplishing temporal knowledge reasoning. In addition, to enhance the timeliness of the events, we devise an additional time prediction task to complete effective temporal information within the learned unified representation. Extensive experiments on six benchmark datasets verify the state-of-the-art performance and the effectiveness of our method.
时空知识图(TKG)推理通常涉及在时间轴上完成缺失的事实元素。尽管现有的方法可以通过整合时间信息来学习每个事实元素的较好嵌入,但它们往往无法推断不同四元组之间的时空事实的演变。这主要是因为(1)对每个四元组内部结构和语义关系内部探索不足;(2)对不同四元组之间的上下文和时间关联的统一表示学习不足。为了克服这些限制,我们提出了一个新颖的Transformer基推理模型(被称为ECEformer),用于TKG学习演化链事件(ECE)。具体来说,我们按时间顺序展开实体节点周围的子图,将演化链事件作为输入,为我们的模型。随后,我们利用Transformer编码器学习 intra-quadruples 的嵌入。然后,我们根据多层感知器(MLP)构建混合上下文推理模块,学习不同四元组之间的统一表示,实现时间知识推理。此外,为了增强事件的及时性,我们还设计了一个额外的时间预测任务,以完成所学习到的统一表示中的有效时间信息。在六个基准数据集上的大量实验证实了我们的方法具有最先进的表现和效果。
https://arxiv.org/abs/2405.00352
Large language models (LLMs) that are proved to be very powerful on different NLP tasks. However, there are still many ways to attack the model with very low costs. How to defend the model becomes an important problem. In our work, we treat adversarial attack results as a new (unseen) domain of the model, and we frame the defending problem into how to improve the robustness of the model on the new domain. We focus on the task of conversation entailment, where multi-turn natural language dialogues are the premise, and the transformer model is fine-tuned to predict whether a given hypothesis about the given dialogue is true or false. The adversary would attack the hypothesis to fool the model to make the wrong predictions. We apply synonym-swapping as the attack method. To show the robustness of the model, we implement some fine-tuning strategies and propose the embedding perturbation loss as a method to improve the robustness of the model. Finally, we show the importance of our work by discussing the adversarial attacks in NLP in the real world.
大型语言模型(LLMs)已经在许多自然语言处理任务上证明非常强大。然而,仍然有许多方法可以以非常低的成本攻击模型。如何防御模型成为重要问题。在我们的工作中,我们将对抗性攻击结果视为一个新的(未见过的)领域,并将防御问题转化为在为新领域提高模型稳健性的问题。我们关注对话含义任务,其中多轮自然语言对话是前提,对模型进行微调以预测给定对话中假设是否为真或假。攻击者会攻击假设以欺骗模型做出错误的预测。我们采用同义词替换作为攻击方法。为了展示模型的稳健性,我们实现了一些微调策略,并提出了嵌入漂移损失作为提高模型稳健性的方法。最后,我们通过讨论在现实生活中NLP中的对抗性攻击重要性,证明了我们的工作的重要性。
https://arxiv.org/abs/2405.00289
Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.
人工智能在医学领域具有巨大的潜力,但传统上受到大规模数据集缺乏的限制。基础模型、可以适应下游任务的小数据集预训练模型,可以缓解这一问题。英国莫费尔德眼科医院(MEH)的研究人员提出了RETFound-MEH,一种基于900,000张图像的视网膜成像基础模型,其中包括私立医院数据。最近,提出了数据高效的DERETFound模型,该模型在仅使用150,000张可公开获取的图像进行训练的同时,提供了与原模相同的表现。然而,这两种模型在训练初期都需要非常大量的资源,并且在下游使用时也是资源密集型。我们提出了一个新颖的标记重构目标,该目标我们用于训练RETFind-Green,一种仅使用75,000张公开可获取图像的视网膜基础模型,以及400倍较少的计算资源。我们估计RETFind-MEH和DERETFound的训练成本分别为10,000美元和14,000美元,而RETFind-Green的训练成本可以低于100美元,同时具有与原模相当程度的减少环境影响的效应。RETFind-Green在下游使用方面也远更高效:它可以实现14次更快地下载,同时计算向量嵌入速度比需要2.6倍更少的存储空间。尽管如此,RETFind-Green在系统内并没有表现出更差的表现。事实上,与DERETFound和RETFind-MEH相比,RETFind-Green在14个任务上表现最好,相对于六项任务,RETFind-Green的性能更好。我们的结果表明,RETFind-Green是一种非常高效、高性能的视网膜基础模型。我们预计,我们的标记重构目标可以进行扩展,以实现更高的性能,并应用于其他领域,而不仅仅是视网膜成像。
https://arxiv.org/abs/2405.00117
In recent years, zero-shot learning has attracted the focus of many researchers, due to its flexibility and generality. Many approaches have been proposed to achieve the zero-shot classification of the point clouds for 3D object understanding, following the schema of CLIP. However, in the real world, the point clouds could be extremely sparse, dramatically limiting the effectiveness of the 3D point cloud encoders, and resulting in the misalignment of point cloud features and text embeddings. To the point cloud encoders to fit the extremely sparse point clouds without re-running the pre-training procedure which could be time-consuming and expensive, in this work, we propose an unsupervised model adaptation approach to enhance the point cloud encoder for the extremely sparse point clouds. We propose a novel fused-cross attention layer that expands the pre-trained self-attention layer with additional learnable tokens and attention blocks, which effectively modifies the point cloud features while maintaining the alignment between point cloud features and text embeddings. We also propose a complementary learning-based self-distillation schema that encourages the modified features to be pulled apart from the irrelevant text embeddings without overfitting the feature space to the observed text embeddings. Extensive experiments demonstrate that the proposed approach effectively increases the zero-shot capability on extremely sparse point clouds, and overwhelms other state-of-the-art model adaptation approaches.
近年来,由于其灵活性和普适性,零样本学习(Zero-Shot Learning)吸引了许多研究人员的关注。为了实现3D物体理解中点云的零样本分类,许多方法提出了基于CLIP的方案。然而,在现实生活中,点云可能非常稀疏,极大地限制了3D点云编码器的有效性,并导致点云特征与文本嵌入之间的不匹配。为了适应稀疏的点云,避免重新进行预训练,我们在这个工作中提出了一个无监督的模型适应方法,以增强适应稀疏点云的点云编码器。我们提出了一个新颖的融合跨注意层,通过增加可学习标记和注意力模块,扩展了预训练的自注意力层,有效地修改点云特征,同时保持点云特征与文本嵌入之间的对齐。我们还提出了一个基于互补学习的自监督损失模式,鼓励修改后的特征从相关的文本嵌入中分离出来,以避免对特征空间对观察到的文本嵌入过拟合。大量实验证明,与最先进的模型适应方法相比,所提出的方案在稀疏点云上显著提高了零样本能力,并超越了其他方法。
https://arxiv.org/abs/2404.19639
The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. However, the impact of backdoor attacks on multilingual models remains under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data in one or two languages can affect the outputs in languages whose instruction-tuning data was not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5, BLOOM, and GPT-3.5-turbo, with high attack success rates, surpassing 95% in several languages across various scenarios. Alarmingly, our findings also indicate that larger models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments show that triggers can still work even after paraphrasing, and the backdoor mechanism proves highly effective in cross-lingual response settings across 25 languages, achieving an average attack success rate of 50%. Our study aims to highlight the vulnerabilities and significant security risks present in current multilingual LLMs, underscoring the emergent need for targeted security measures.
针对英语中心的大型语言模型(LLMs)的後门攻击的潜在影响已经得到了广泛的探讨。这些攻击可以通过在训练过程中嵌入恶意行为并激活特定条件来实现。然而,对多语言模型的後门攻击影响的研究仍然较少。我们的研究聚焦于针对多语言LLMs的跨语言後门攻击,特别是研究在某些情况下,通过恶意地污染一个或两个语言的指令微调数据,如何影响未被污染的指令微调数据的输出语言。尽管我们的方法非常简单,但我们的实证分析揭示了它在像mT5、BLOOM和GPT-3.5-turbo等模型上表现出非凡的攻击效果,具有高的攻击成功率,在各种情景下的成功率超过了95%。我们的研究还发现,越大型的模型对可转移的跨语言後门攻击的易感性也越强,这也适用于主要在英语数据上预训练的LLM,如Llama2、Llama3和Gemma。此外,我们的实验还表明,即使进行同义词替换,触发器仍然可以起作用,跨语言响应设置中的后门机制在25种语言上取得了高度有效的攻击成功率,平均攻击成功率为50%。我们的研究旨在强调当前多语言LLM中存在的漏洞和显著的安全风险,强调需要采取针对性的安全措施。
https://arxiv.org/abs/2404.19597
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
许多现有的运动预测方法依赖于符号感知输出生成代理程序轨迹,例如边界框、道路图信息和交通灯。这种符号表示是对现实世界的高层次抽象,这可能导致运动预测模型在感知错误(例如未能检测到开放词汇障碍)时变得脆弱,同时场景上下文中的显着信息缺失(例如道路条件差)也可能会导致问题。一种替代的范式是从原始传感器进行端到端学习。然而,这种方法存在可解释性差和需要大量训练资源的问题。在这项工作中,我们提出将视觉世界划分为一系列场景元素,然后利用预训练的图像基础模型和激光雷达神经网络以开放词汇的方式编码所有场景元素。图像基础模型使我们场景的像素级表示具有通用知识,而激光雷达神经网络则编码几何信息。我们提出的表示能够以极少的几个 hundred 个标记语义单元高效地编码多帧多模态观察。我们的方法与大多数基于Transformer的架构兼容。为了评估我们的方法,我们在Waymo Open Motion Dataset上用相机嵌入进行了增强。在Waymo Open Motion Dataset上的实验表明,我们的方法在相较于最先进方法的情况下取得了显著的性能提升。
https://arxiv.org/abs/2404.19531
Anomaly synthesis is one of the effective methods to augment abnormal samples for training. However, current anomaly synthesis methods predominantly rely on texture information as input, which limits the fidelity of synthesized abnormal samples. Because texture information is insufficient to correctly depict the pattern of anomalies, especially for logical anomalies. To surmount this obstacle, we present the AnomalyXFusion framework, designed to harness multi-modality information to enhance the quality of synthesized abnormal samples. The AnomalyXFusion framework comprises two distinct yet synergistic modules: the Multi-modal In-Fusion (MIF) module and the Dynamic Dif-Fusion (DDF) module. The MIF module refines modality alignment by aggregating and integrating various modality features into a unified embedding space, termed X-embedding, which includes image, text, and mask features. Concurrently, the DDF module facilitates controlled generation through an adaptive adjustment of X-embedding conditioned on the diffusion steps. In addition, to reveal the multi-modality representational power of AnomalyXFusion, we propose a new dataset, called MVTec Caption. More precisely, MVTec Caption extends 2.2k accurate image-mask-text annotations for the MVTec AD and LOCO datasets. Comprehensive evaluations demonstrate the effectiveness of AnomalyXFusion, especially regarding the fidelity and diversity for logical anomalies. Project page: http:github.com/hujiecpp/MVTec-Caption
异常合成是一种有效的增强训练异常样本的方法。然而,现有的异常合成方法主要依赖于纹理信息作为输入,这限制了合成异常样本的保真度。因为纹理信息不足以正确地描绘异常的图案,尤其是对于逻辑异常。为克服这一障碍,我们提出了AnomalyXFusion框架,旨在利用多模态信息提高合成异常样本的质量。AnomalyXFusion框架包括两个不同的但相互作用的模块:多模态In-Fusion(MIF)模块和动态Dif-Fusion(DDF)模块。MIF模块通过聚合和整合各种模态特征到统一的嵌入空间X-embedding中,称为X-嵌入,来优化模态对齐。同时,DDF模块通过根据扩散步骤自适应调整X-嵌入来促进控制生成。此外,为了揭示AnomalyXFusion的多模态表示能力,我们提出了一个新的数据集,称为MVTec Caption。更具体地说,MVTec Caption扩展了MVTec AD和LOCO数据集中的2200个准确图像-纹理-文本注释。全面的评估证明了AnomalyXFusion的有效性,特别是对于逻辑异常的保真度和多样性。项目页面:http:github.com/hujiecpp/MVTec-Caption
https://arxiv.org/abs/2404.19444
We present an information retrieval based reverse dictionary system using modern pre-trained language models and approximate nearest neighbors search algorithms. The proposed approach is applied to an existing Estonian language lexicon resource, Sõnaveeb (word web), with the purpose of enhancing and enriching it by introducing cross-lingual reverse dictionary functionality powered by semantic search. The performance of the system is evaluated using both an existing labeled English dataset of words and definitions that is extended to contain also Estonian and Russian translations, and a novel unlabeled evaluation approach that extracts the evaluation data from the lexicon resource itself using synonymy relations. Evaluation results indicate that the information retrieval based semantic search approach without any model training is feasible, producing median rank of 1 in the monolingual setting and median rank of 2 in the cross-lingual setting using the unlabeled evaluation approach, with models trained for cross-lingual retrieval and including Estonian in their training data showing superior performance in our particular task.
我们提出了一个基于现代预训练语言模型和近似最近邻居搜索算法的信息检索反词典系统。该方法应用于现有的爱沙尼亚语词汇表资源Sõnaveeb(词网),旨在通过引入跨语言反词典功能,增强和丰富它。系统性能通过同时使用现有的带有爱沙尼亚和俄语翻译的英语标签数据集以及使用同义词关系从词汇表资源本身中提取评估数据来进行评估。评估结果显示,基于语义搜索的没有模型训练的信息检索方法是可行的,在单语环境中 median 排名为1,在跨语言环境中 median 排名为2,使用无标签评估方法。特别地,为跨语言检索训练模型的模型在 our 特定任务上表现出卓越的性能。
https://arxiv.org/abs/2404.19430
In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.
在个性化图像生成领域,创建保留概念的图像的能力已经显著提高。创建一个自然地将多个概念集成在统一且视觉上吸引人的构图中的图像确实具有挑战性。本文介绍了一种名为“InstantFamily”的方法,该方法采用了一种新颖的遮罩交叉注意力和多模态嵌入堆栈来实现零散ID图像生成。我们的方法有效地保留了ID,因为它利用了与文本条件预训练的人脸识别模型。此外,我们的遮罩交叉注意力机制使得在生成的图像中精确控制多ID和构图。我们通过实验证明InstantFamily在生成具有多ID的图像方面具有优势,同时解决了已知的多ID生成问题。此外,我们的模型在单ID和多ID保留方面都达到了最先进的性能。此外,我们的模型在保留更多ID的情况下表现出了出色的可扩展性。
https://arxiv.org/abs/2404.19427
Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.
高级文本到图像扩散模型引起了关于隐私侵犯、版权侵犯以及禁止工作内容生成的安全问题。为了解决这个问题,已经开发了一些 unlearning 方法来删除这些涉及概念的扩散模型。然而,这些 unlearning 方法仅将文本到图像映射转移,并保留扩散模型中生成空间内的视觉内容,这留下了恢复这些消除概念的致命缺陷。消除信任问题需要深入研究,但之前的方法从两个方面来看都是次优的:(1)可转移性:一些方法在白盒环境中运行,需要访问消除模型的访问权限。而学到的对抗输入往往无法转移到其他消除模型的概念恢复;(2)攻击范围有限:提示级别方法很难从消除模型中恢复狭窄的概念,如名人身份。因此,本文旨在利用对抗攻击的可转移性在黑盒环境中研究消除模型的稳健性。这种具有挑战性的场景假设消除方法是未知的,消除模型对于优化是不可访问的,攻击需要具备跨不同消除模型的转移能力。具体来说,我们采用对抗搜索策略来搜索跨越不同消除模型的对抗嵌入。这种策略使用原始的稳定扩散模型作为代理模型,通过迭代消除和搜索嵌入来找到可以恢复不同消除方法的目标概念的嵌入。大量的实验证明,搜索到的对抗嵌入在几种最先进的消除方法和不同概念水平之间具有可转移性,并且对于概念恢复具有有效性。
https://arxiv.org/abs/2404.19382
Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of queries. These features are then seamlessly integrated into traditional query and passage embeddings using an attention mechanism. By deepening the comprehension of the semantic queries-passage relationship, our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers. Experimental results on robust Question-Answer datasets confirm that our approach effectively handles format-variant but semantically identical queries, highlighting the effectiveness and adaptability of our proposed method.
提取式问题回答(EQA)在机器阅读理解(MRC)中通常面临着处理语义相同但格式不同的输入的挑战。我们的工作提出了一种名为“查询潜在语义校准器(QLSC)”的新方法,作为现有MRC模型的辅助模块。我们提出了一种独特的缩放策略,以捕捉查询的潜在语义中心特征。这些特征通过注意机制无缝地整合到传统的查询和经过文本的 passage 嵌入中。通过加深对语义查询-经过文本关系的理解,我们的方法降低了文本格式变化对模型的敏感度,提高了模型在精确回答问题方面的能力。在 robust Question-Answer 数据集上进行的实验结果证实了我们的方法有效地处理了语义相同但格式不同的查询,突出了我们提出方法的有效性和可适应性。
https://arxiv.org/abs/2404.19316
Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
有效地对文本数据进行标准化面临着相当大的挑战,尤其是在缺乏标准化写作系统的低资源语言中。在这项研究中,我们通过使用多个奥克语方言的数据对多语言模型进行了微调,并进行了一系列实验来评估该模型对这些方言的表示。为了评估目的,我们编写了一个包含四个奥克语方言的并行词汇表。对模型嵌入的内在评估表明,方言之间的表面相似性增强了表示。当模型进一步微调为词性标注和统一依赖解析时,其表现对方言差异具有鲁棒性,即使仅基于一个方言的词性数据进行训练。我们的研究结果表明,大型多语言模型在预处理过程中减少了 spelling normalization 的需求。
https://arxiv.org/abs/2404.19315
Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been overlooked. In this work, we initiate the first known and comprehensive effort to study adapting vision-language models for adversarial robustness under the multimodal attack. Firstly, we introduce a multimodal attack strategy and investigate the impact of different attacks. We then propose a multimodal contrastive adversarial training loss, aligning the clean and adversarial text embeddings with the adversarial and clean visual features, to enhance the adversarial robustness of both image and text encoders of CLIP. Extensive experiments on 15 datasets across two tasks demonstrate that our method significantly improves the adversarial robustness of CLIP. Interestingly, we find that the model fine-tuned against multimodal adversarial attacks exhibits greater robustness than its counterpart fine-tuned solely against image-based attacks, even in the context of image attacks, which may open up new possibilities for enhancing the security of VLMs.
预训练的视觉-语言模型(VLMs)如CLIP在各种下游任务上的表现令人印象深刻,但它们仍然容易受到攻击。虽然先前的研究主要集中在提高图像编码器对图像攻击的抗攻击性以保护其免受攻击,但针对文本和多模态攻击的探索仍然被忽视了。在这项工作中,我们旨在研究将视觉-语言模型适应多模态攻击。首先,我们引入了一种多模态攻击策略,并研究了不同的攻击方式。然后,我们提出了一种多模态对比性 adversarial 训练损失,将清洁和攻击性文本嵌入与攻击性和清洁的视觉特征对齐,以增强 CLIP 图像和文本编码器的对抗性 robustness。在两个任务上的15个数据集的广泛实验证明,我们的方法显著提高了 CLIP 的对抗性 robustness。有趣的是,我们发现,与仅针对图像攻击进行微调相比,将模型微调以对抗多模态攻击表现出更大的稳健性,即使在图像攻击的情况下也是如此,这为增强 VLMs 的安全性提供了新的可能性。
https://arxiv.org/abs/2404.19287
Knowledge graphs (KGs) are large datasets with specific structures representing large knowledge bases (KB) where each node represents a key entity and relations amongst them are typed edges. Natural language queries formed to extract information from a KB entail starting from specific nodes and reasoning over multiple edges of the corresponding KG to arrive at the correct set of answer nodes. Traditional approaches of question answering on KG are based on (a) semantic parsing (SP), where a logical form (e.g., S-expression, SPARQL query, etc.) is generated using node and edge embeddings and then reasoning over these representations or tuning language models to generate the final answer directly, or (b) information-retrieval based that works by extracting entities and relations sequentially. In this work, we evaluate the capability of (LLMs) to answer questions over KG that involve multiple hops. We show that depending upon the size and nature of the KG we need different approaches to extract and feed the relevant information to an LLM since every LLM comes with a fixed context window. We evaluate our approach on six KGs with and without the availability of example-specific sub-graphs and show that both the IR and SP-based methods can be adopted by LLMs resulting in an extremely competitive performance.
知识图(KGs)是具有特定结构的大型数据集,代表大量知识库(KB),其中每个节点表示一个关键实体,它们之间的关系是类型化的边。自然语言查询从特定的节点开始,通过推理与相应知识库中的多个边之间的关系,到达正确的答案节点。传统基于KG的问答方法是基于(a)语义解析(SP),其中使用节点和边嵌入生成逻辑形式(例如S-表达式、SPARQL查询等),然后在这些表示或调整语言模型的基础上进行推理,或(b)信息检索,该方法通过按顺序提取实体和关系来工作。 在这项工作中,我们评估了(LLMs)在回答涉及多个级的KG问题的能力。我们证明了,根据KG的大小和性质,我们需要不同的方法来提取和向LLM提供相关信息,因为每个LLM都具有固定的上下文窗口。我们在六个具有和没有例子特定子图的KG上评估我们的方法,并发现基于IR和SP的方法都可以被LLM采用,导致具有非常竞争力的性能。
https://arxiv.org/abs/2404.19234