Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
大规模文本到图像生成模型是生成AI领域的突破性进展,扩散模型表现出在输入文本 prompt 后生成令人信服的图像的能力。图像编辑研究的目标是通过修改文本 prompt 来让用户控制生成的图像。目前的图像编辑技术容易意外修改超出目标区域的区域,例如背景或与目标对象有某些语义或视觉关系的干扰对象。根据我们的实验结果,不准确的交叉注意力地图是这个问题的根源。基于这个观察,我们提出了动态Prompt Learning(DPL),以强制交叉注意力地图关注文本 prompt 中的正确名词单词。通过更新动态代币对名词的文本输入中的动态代币,我们实现了对特定物体的细粒度图像编辑,同时防止其他图像区域不必要的变化。基于公开可用的稳定扩散方法,我们对多种图像进行了广泛评估,并 consistently 获得了 quantitative(CLIP score,结构-dist)和 qualitative(用户评估)上卓越的结果。我们展示了改进的 prompt 编辑结果,包括单词交换、Prompt refinement 和注意力重新加权,特别是复杂多物体场景。
https://arxiv.org/abs/2309.15664
Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
理解物体之间的关系对于理解视觉场景的语义是至关重要的。它也是连接视觉和语言模型的至关重要的步骤。然而,当前最先进的计算机视觉模型仍然缺乏很好的空间推理能力。现有的数据集大多覆盖了相对较小的空间关系,这些空间关系并不包含本质上不涉及运动的关系。在本文中,我们提出了“空间与时间的理解”数据集(STUPD)——一个大规模的视频数据集,用于理解从英语短语中的空间关系提取的静态和动态空间关系。该数据集包含150万可视化呈现(视频和图像),由30个不同的空间短语感知组成,通过使用Unity3D生成的对象交互模拟生成。除了空间关系,我们还提出了10个时间关系上的50万可视化呈现,包括视频描述事件/时间点交互。据我们所知,目前还没有数据集通过视觉设置来表示时间关系。在这个数据集中,我们提供了有关对象交互的三维信息,如帧率坐标和使用的对象的描述。该合成数据集的目的是帮助模型在现实世界场景中更好地进行视觉关系检测。我们在STUPD数据集上预训练后,与其他预训练数据集相比,证明了各种模型的性能提高了。我们证明了在不同的模型上,通过视觉设置检测视觉关系时,性能提高了。
https://arxiv.org/abs/2309.06680
Food image classification serves as a fundamental and critical step in image-based dietary assessment, facilitating nutrient intake analysis from captured food images. However, existing works in food classification predominantly focuses on predicting 'food types', which do not contain direct nutritional composition information. This limitation arises from the inherent discrepancies in nutrition databases, which are tasked with associating each 'food item' with its respective information. Therefore, in this work we aim to classify food items to align with nutrition database. To this end, we first introduce VFN-nutrient dataset by annotating each food image in VFN with a food item that includes nutritional composition information. Such annotation of food items, being more discriminative than food types, creates a hierarchical structure within the dataset. However, since the food item annotations are solely based on nutritional composition information, they do not always show visual relations with each other, which poses significant challenges when applying deep learning-based techniques for classification. To address this issue, we then propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process, which allows the deep model to extract image features that are discriminative across labels. Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.
食品图像分类在基于图像的膳食评估中扮演着至关重要且关键的步骤,便于从捕获的食品图像中分析营养素摄入。然而,现有的食品分类工作主要关注预测“食品类型”,这些食品类型并没有直接的营养组成信息。这种限制源于营养数据库之间的固有差异,其任务是将每个“食品 item”与相应的信息关联起来。因此,在本工作中,我们旨在将食品 items 与营养数据库对齐,实现食品 item 分类。为此,我们首先介绍了 VFN 营养数据集,通过在 VFN 中为每个食品图像标注包含营养组成信息的食品 item。这种食品 item 的标注,比食品类型更具体,在数据集中创造了层级结构。然而,由于食品 item 标注仅基于营养组成信息,它们并不总是表现出视觉关系,这在应用深度学习技术进行分类时提出了重大挑战。为了解决这个问题,我们提出了一个多级Hierarchical 框架,通过迭代地簇集和合并食品 items during 训练过程,从而使深度模型能够提取跨越标签的视觉特征。我们的方法在 VFN 营养数据集上进行评估,与现有工作在食品类型和食品 item 分类方面相比,取得了令人鼓舞的结果。
https://arxiv.org/abs/2309.01075
Scene Graph Generation (SGG) aims to detect all the visual relation triplets <sub, pred, obj> in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.
Scene Graph Generation (SGG) 旨在在给定图像中检测所有视觉关系三对数 <sub,pred,obj>。随着各种高级技术更好地利用每个关系三对数的内在和外部信息的出现,SGG在过去几年中取得了巨大的进展。然而,由于普遍存在长尾巴的谓词分布,今天的SGG模型仍然很容易受到头谓词的影响。目前,SGG最常见的抗偏解决方案是重新平衡方法,例如改变原始训练样本的分布。在本文中,我们主张,所有现有的重新平衡策略都没有增加每个谓词的关系三对数特征的多样性,这是SGG稳健的关键。为此,我们提出了一种全新的组合特征增强策略,它是SGG中第一个从增加三对数特征多样性的角度来看消除偏见的工作。具体来说,我们首先将每个关系三对数特征分解为两个组件:内在特征和外部特征,它们对应于一个关系三对数的内在特征和外部上下文。然后,我们设计两个不同的特征增强模块,以丰富原始关系三对数的特征多样性,通过从其他样本中替换或混合它们的内在或外部特征。由于其独特的模型无关性,CFA可以无缝融入各种SGG框架中。广泛的实验表明,CFA在不同度量之间的权衡中实现了新的最先进的性能。
https://arxiv.org/abs/2308.06712
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).
场景图生成的目标是检测视觉关系三元组(主题、谓词、对象)。由于数据中的偏见,当前模型往往预测常见的谓词,例如“在”和“在”,而不是有用的谓词,例如“站在”和“看着”。这种趋势导致准确的信息和整体表现的损失。如果模型仅使用“在路上”而不是“在路上堵住了”来描述一个图像,这可能是一个非常严重的误解。我们认为,这种情况是由两个不平衡因素引起的:语义空间上的不平衡和训练样本水平的不平衡。为了解决这一问题,我们提出了 DB-SGG,一个基于去偏但不同于传统分布适应的有效框架。它整合了两个组件:语义去偏(SD)和平衡谓词学习(BPL)。SD利用混淆矩阵和二分类图来构建谓词关系。BPL采用随机Undersampling策略和歧义移除策略,重点优化有用的谓词。得益于无模型过程,我们的方法和Transformer在 SGG 数据集上的三个 SGG 子任务中的 mR@20 表现相比提高了136.3%、119.5% 和 122.6%。我们在另一个复杂的 SGG 数据集(SGG-GQA)和两个后续任务(sentence-to-graph 检索和图像摘要)上进行了进一步验证。
https://arxiv.org/abs/2308.05286
Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult. Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple approach to solve the task of table tidying, an example of robotic tasks with vague objectives. Specifically, the task of tidying a table involves not just clustering objects by type and functionality for semantic tidiness but also considering spatial-visual relations of objects for a visually pleasing arrangement, termed as visual tidiness. We propose to learn a lightweight, image-based tidiness score function to ground the semantically tidy policy of LLMs to achieve visual tidiness. We innovatively train the tidiness score using synthetic data gathered using random walks from a few tidy configurations. Such trajectories naturally encode the order of tidiness, thereby eliminating the need for laborious and expensive human demonstrations. Our empirical results show that our pipeline can be applied to unseen objects and complex 3D arrangements.
在许多实际场景中,许多目标变得模糊,给机器人带来了长期的挑战,因为定义规则、奖励或限制进行优化非常困难。例如,整理一个混乱的桌子可能会对人类来说看起来很简单,但描述整理的标准因为常识推理中的歧义和灵活性而非常复杂。最近,大型语言模型(LLM)的发展为我们提供了解决这个问题的机会:从广泛的人类数据中学习,LLM能够捕捉对人类行为有意义的常识推理。然而,由于LLM仅从语言输入中训练,它们可能会与机器人任务遇到困难,因为它们没有足够的能力处理感知和低级别控制。在这项工作中,我们提出了一种简单的方法来解决桌子整理任务,这是一个模糊目标机器人任务的示例。具体来说,整理桌子的任务不仅涉及按类型和功能将对象分组以实现语义整洁,而且还考虑空间-视觉关系,以创建一个视觉效果良好的排列,称为视觉整洁。我们建议学习一种轻量级的基于图像的整洁得分函数,以 ground LLM 语义整洁的政策,实现视觉整洁。我们创新性地使用合成数据从几个整洁配置中收集,通过随机漫步方式生成这些路径,这些路径自然编码整洁的顺序,从而消除了繁琐的人类演示的需求。我们的实验结果表明,我们的管道可以应用于未知的对象和复杂的三维布局。
https://arxiv.org/abs/2307.11319
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
Video Visual Relation Detection (VidVRD) 旨在使用空间边界和时间边界检测视频中的视觉关系 triplets。现有的 VidVRD 方法可以根据不同的分类方法将其分为bottom-up和top-down paradigm,取决于其分类方法。bottom-up方法基于片段的方法,将短片 tubelet 对的关系分类并将它们合并成较长的视频关系。top-down方法直接分类较长的视频 tubelet 对。虽然使用视频 tubelets 的video-based 方法已经取得了令人瞩目的结果,但我们认为有效的空间和时间建模比选择片段 tubelets 和视频 tubelets 更为重要。这激励我们重新考虑基于片段的分类 paradigm 并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种Hierarchical Context Model (HCM),该模型基于片段来丰富基于对象的空间和基于关系的时间的上下文。我们证明,使用片段 tubelet 可以比大多数基于视频的方法获得更好的性能。此外,使用片段 tubelet 可以在模型设计中提供更多的灵活性,并减轻与视频 tubelets 相关的限制,例如挑战性的长期对象跟踪问题和长期 tubelet 特征压缩中的时间信息丢失问题。在两个挑战性的 VidVRD 基准测试中进行了广泛的实验验证,我们的 HCM 实现了新的先进技术性能,强调了在基于片段的分类 paradigm 内引入高级的空间和时间上下文建模的有效性。
https://arxiv.org/abs/2307.08984
In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based representations for all three bounding boxes and introduces a novel contrastive training strategy to automatically infer the text prompt for union-box. Our approach achieves state-of-the-art performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual Genome benchmark, achieving a 15.3\% gain in performance over recent state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in object relation prediction and encourages further research on VLMs in this challenging domain.
在本文中,我们探讨了视觉语言模型(VLMs,特别是CLIP)在预测视觉对象关系方面的潜力,这种关系是通过从图像中解释视觉特征到基于语言的关系中建立的。当前最先进的方法使用复杂的图形模型,利用语言提示和视觉特征来解决这个挑战。我们假设,在CLIP嵌入中的强语言先验可以简化这些图形模型,从而提供一个更简单的方法。我们采用UVTransE关系预测框架,该框架将关系学习为一个 Translation Embedding,从场景中提取主题、对象和合并框的嵌入。我们系统地探索了在UVTransE框架内,基于CLIP的主题、对象和合并框表示的设计,并提出了CREPE(CLIP表示增强的条件估计)。CREPE使用所有三个边界框的文本表示,并引入了一种新的对比度训练策略,以自动推断合并框的文本提示。我们的方法在条件估计方面实现了最先进的性能,mR@5 27.79,mR@20 31.95,在Visual Genome基准测试中取得了15.3\%的性能提升,比最近最先进的方法在mR@20上的性能提高了%。这项工作展示了CLIP在对象关系预测方面的效力,并鼓励在此挑战性的领域研究VLMs。
https://arxiv.org/abs/2307.04838
Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference.
场景图生成研究(SGG)通常考虑两阶段模型,即检测一组实体,然后将它们组合并标记所有可能的关系。虽然结果显示出 promising 结果,但管道结构导致大型参数和计算 overhead,并通常阻碍端到端优化。为了解决这个问题,最近的研究尝试训练计算效率高的单一阶段模型。随着DETR的出现,基于集合检测模型的单一阶段模型试图直接预测一组主题-谓词-对象三体直接。然而,SGG本质上是一个多任务学习问题,需要同时建模实体和谓语分布。在本文中,我们提出了对SGG的条件查询Transformer,即 TraCQ,并提出了SGG的新 formulation,以避免多任务学习问题和组合实体对分布。我们采用基于DETR的编码-解码设计,并利用条件查询显著减少实体标签空间,这导致与最先进的单一阶段模型相比,参数减少了20%。实验结果显示,TraCQ不仅优于现有的单一阶段场景图生成方法,还在视觉基因组数据集上击败了许多最先进的两阶段方法,但能够进行端到端训练和更快的推理。
https://arxiv.org/abs/2306.05689
Learning to compose visual relationships from raw images in the form of scene graphs is a highly challenging task due to contextual dependencies, but it is essential in computer vision applications that depend on scene understanding. However, no current approaches in Scene Graph Generation (SGG) aim at providing useful graphs for downstream tasks. Instead, the main focus has primarily been on the task of unbiasing the data distribution for predicting more fine-grained relations. That being said, all fine-grained relations are not equally relevant and at least a part of them are of no use for real-world applications. In this work, we introduce the task of Efficient SGG that prioritizes the generation of relevant relations, facilitating the use of Scene Graphs in downstream tasks such as Image Generation. To support further approaches in this task, we present a new dataset, VG150-curated, based on the annotations of the popular Visual Genome dataset. We show through a set of experiments that this dataset contains more high-quality and diverse annotations than the one usually adopted by approaches in SGG. Finally, we show the efficiency of this dataset in the task of Image Generation from Scene Graphs. Our approach can be easily replicated to improve the quality of other Scene Graph Generation datasets.
学习从原始图像构建场景图的形式来构建视觉关系是一项高度挑战的任务,因为它与环境上下文依赖关系有关,但在依赖于场景理解的视觉应用中是必不可少的。然而,当前在场景图生成(SGG)方面的方法和目标都没有旨在为后续任务提供有用的图形。相反,它们的主要焦点主要是在无偏见地分配数据分布以预测更精细的关系的任务上。尽管如此,所有精细的关系并不是同等重要的,至少其中一部分对于实际应用场景来说是无用的。在这项工作中,我们介绍了高效的SGG任务,该任务 prioritizes生成相关关系,并促进了场景图在 Image Generation 等后续任务中的应用。为了支持这一任务,我们介绍了基于 popular Visual Genome dataset 的注释的新数据集 VG150,并通过一组实验表明,该数据集比SGG方法通常采用的更高质量和多样性的注释更多。最后,我们展示了该数据集在从场景图生成的图像生成任务中的效率。我们的方法可以轻松地复制来提高其他场景图生成数据集的质量。
https://arxiv.org/abs/2305.18668
Memes are a popular form of communicating trends and ideas in social media and on the internet in general, combining the modalities of images and text. They can express humor and sarcasm but can also have offensive content. Analyzing and classifying memes automatically is challenging since their interpretation relies on the understanding of visual elements, language, and background knowledge. Thus, it is important to meaningfully represent these sources and the interaction between them in order to classify a meme as a whole. In this work, we propose to use scene graphs, that express images in terms of objects and their visual relations, and knowledge graphs as structured representations for meme classification with a Transformer-based architecture. We compare our approach with ImgBERT, a multimodal model that uses only learned (instead of structured) representations of the meme, and observe consistent improvements. We further provide a dataset with human graph annotations that we compare to automatically generated graphs and entity linking. Analysis shows that automatic methods link more entities than human annotators and that automatically generated graphs are better suited for hatefulness classification in memes.
弹幕(Memes)是社交媒体和互联网中一种流行的方式来传达趋势和想法,结合了图像和文本的特性。它们可以表达幽默和讽刺,但也具有攻击性的内容。自动分析和分类弹幕具有挑战性,因为它们的解释依赖于对视觉元素、语言和背景知识的理解。因此,为了更好地分类一个弹幕,必须有意义地代表这些来源和它们之间的互动。在这个研究中,我们建议使用场景图,以图像为单位,表达对象及其视觉关系,并使用知识图作为基于Transformer架构的弹幕分类的结构化表示。我们与ImgBERT(一个多模态模型,仅使用学习而不是结构化的弹幕表示)进行比较,并观察了一致性的提高。我们还提供了带有人类图标注的 dataset,并将其与自动生成的 Graph 和实体链接进行比较。分析表明,自动方法链接比人类标注员更多实体,而自动生成的 Graph 更适合于弹幕中仇恨分类。
https://arxiv.org/abs/2305.18391
Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we add supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional views of images. With masked relation prediction, we further encourage relating entities from visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.
最近的 vision-and-language 预训练工作研究了从对象检测数据中学习的指令信号,以学习更精细的多模式表示。在这项工作中,我们进一步深入研究了从小型视觉关系数据中增加监督的方法。特别是,我们提出了两种预训练方法,以在一个多模式setup中上下文化视觉实体。通过语音化场景Graph,我们将视觉关系三视图转换为结构化标题,并将其视为图像额外的视角。通过掩码关系预测,我们进一步鼓励从视觉上掩码的上下文中相关的实体。将这些方法应用于大量的 Web 数据预训练的强基线,对粗粒度和精粒度任务进行的零样本评估表明,我们的方法从弱监督的关系数据学习多模式表示的有效性。
https://arxiv.org/abs/2305.14281
Visual relation extraction (VRE) aims to extract relations between entities from visuallyrich documents. Existing methods usually predict relations for each entity pair independently based on entity features but ignore the global structure information, i.e., dependencies between entity pairs. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner. Given a scanned image of the document, GOSE firstly generates preliminary relation predictions on entity pairs. Secondly, it mines global structure knowledge based on prediction results of the previous iteration and further incorporates global structure knowledge into entity representations. This "generate-capture-incorporate" schema is performed multiple times so that entity representations and global structure knowledge can mutually reinforce each other. Extensive experiments show that GOSE not only outperforms previous methods on the standard fine-tuning setting but also shows promising superiority in cross-lingual learning; even yields stronger data-efficient performance in the low-resource setting.
视觉关系提取(VRE)旨在从视觉丰富的文档中提取实体之间的关系。现有的方法通常基于实体特征独立地预测每个实体对之间的关系,但忽略了全球结构信息,即实体对之间的依赖关系。缺乏全球结构信息可能导致模型 struggle 学习远程关系,并容易预测冲突的结果。为了减轻这些限制,我们提出了一个GLObal Structure知识 guided关系提取(GOSE)框架,该框架以迭代方式捕获实体对之间的依赖关系。给定文档的扫描图像,GOSE首先在实体对上生成初步的关系预测。其次,它基于先前迭代的预测结果挖掘全球结构知识,并进一步将全球结构知识融入实体表示中。这种“生成-捕获-融合” schema 多次执行,以便实体表示和全球结构知识可以互相加强。广泛的实验表明,GOSE不仅在标准微调环境中比先前方法表现更好,而且在跨语言学习中表现出令人瞩目的优越性;甚至在资源匮乏的环境中表现出更强的数据高效性。
https://arxiv.org/abs/2305.13850
NeSy4VRD is a multifaceted resource designed to support the development of neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the images of the VRD dataset and couples them with an extensively revised, quality-improved version of the VRD visual relationship annotations. Crucially, NeSy4VRD provides a well-aligned, companion OWL ontology that describes the dataset this http URL comes with open source infrastructure that provides comprehensive support for extensibility of the annotations (which, in turn, facilitates extensibility of the ontology), and open source code for loading the annotations to/from a knowledge graph. We are contributing NeSy4VRD to the computer vision, NeSy and Semantic Web communities to help foster more NeSy research using OWL-based knowledge graphs.
NeSy4VRD是一个多用途资源,旨在支持神经符号人工智能(NeSy)的研究。它重新恢复了VRD数据集的图像公开访问,并将它们与经过广泛修订和提高质量的VRD视觉关系注释版本相结合。关键地,NeSy4VRD提供了一组配合默契的OWL本体论,该本体论描述了这个http URL所包含的该数据集,提供了全面支持扩展注释(这反过来促进了本体论的扩展)的开源基础设施,以及用于将注释加载到知识图谱上的开源代码。我们正在向计算机视觉、NeSy和语义网社区贡献NeSy4VRD,以帮助促进基于OWL的知识图谱的更多NeSy研究。
https://arxiv.org/abs/2305.13258
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.
训练有素的视觉语言模型,如CLIP,已经表现出强大的泛化能力,使其成为零次视觉识别领域有前途的工具。视觉关系检测(VRD)是一种常见的任务,该任务在图像中识别关系(或交互)类型的不同细节类型。然而,天真地使用基于类的Clip提示进行零次VRD有以下几个弱点,例如,它 struggle 很难区分不同细致的关系类型,并且它忽略了两个对象的重要空间信息。为此,我们提出了一种零次VRD的新型方法:RECODE,该方法通过合并描述提示解决关系检测。具体来说,RECODE首先将每个谓词类别分解为主题、对象和空间组件。然后,它利用大型语言模型(LLM)生成每个组件的描述性提示(或视觉提示)。不同的视觉提示增强从不同的角度识别类似关系类别的可区分性,这显著提高了VRD的性能。为了动态融合不同的提示,我们引入了一种思考链方法,promptLLM生成合理的视觉提示权重。在四个VRD基准实验中,已经证明了RECODE的有效性和解释性。
https://arxiv.org/abs/2305.12476
Visual reasoning is a long-term goal of vision research. In the last decade, several works have attempted to apply deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of the generalization of the relations learned. In recent years, several innovations in DNNs have been developed in order to enable learning abstract relation from images. In this work, we systematically evaluate a series of DNNs that integrate mechanism such as slot attention, recurrently guided attention, and external memory, in the simplest possible visual reasoning task: deciding whether two objects are the same or different. We found that, although some models performed better than others in generalizing the same-different relation to specific types of images, no model was able to generalize this relation across the board. We conclude that abstract visual reasoning remains largely an unresolved challenge for DNNs.
视觉推理是视觉研究的长期目标。在过去的十年中,有几种研究尝试将深度学习神经网络(DNN)应用于从图像中学习视觉关系的任务,但所取得的 generalization 效果相对较低。近年来,DNN 中几项创新已经被开发出来,以便从图像中学习抽象关系。在本研究中,我们系统地评估了一系列 DNN,这些 DNN 集成了例如窗体注意力、循环引导注意力和外部记忆等机制,在最简单的视觉推理任务中:决定两个物体是否相同或不同。我们发现,虽然某些模型在将相同-不同关系泛化到特定类型的图像方面表现更好,但没有任何模型能够在所有情况下泛化 this 关系。我们得出结论,抽象的视觉推理仍然是 DNN 面临的未解决挑战。
https://arxiv.org/abs/2304.07091
The transferability of adversarial examples is a crucial aspect of evaluating the robustness of deep learning systems, particularly in black-box scenarios. Although several methods have been proposed to enhance cross-model transferability, little attention has been paid to the transferability of adversarial examples across different tasks. This issue has become increasingly relevant with the emergence of foundational multi-task AI systems such as Visual ChatGPT, rendering the utility of adversarial samples generated by a single task relatively limited. Furthermore, these systems often entail inferential functions beyond mere recognition-like tasks. To address this gap, we propose a novel Visual Relation-based cross-task Adversarial Patch generation method called VRAP, which aims to evaluate the robustness of various visual tasks, especially those involving visual reasoning, such as Visual Question Answering and Image Captioning. VRAP employs scene graphs to combine object recognition-based deception with predicate-based relations elimination, thereby disrupting the visual reasoning information shared among inferential tasks. Our extensive experiments demonstrate that VRAP significantly surpasses previous methods in terms of black-box transferability across diverse visual reasoning tasks.
对抗性样本的可移植性是评估深度学习系统鲁棒性的一个重要方面,特别是在黑盒场景下。尽管已经提出了多种方法来增强跨模型可移植性,但很少关注不同任务之间的对抗性样本可移植性。这一问题随着基于视觉任务的 foundation AI系统(如Visual ChatGPT)的出现而变得越来越重要,使得单个任务生成的对抗样本的实用性相对有限。此外,这些系统通常还包括超越仅仅识别相似的任务推理函数。为了解决这一差距,我们提出了一种基于视觉关系的新跨任务对抗块生成方法称为 VRAP,旨在评估各种视觉任务,特别是涉及视觉推理的任务(如视觉问答和图像标题)的鲁棒性。VRAP使用场景图将对象识别基于欺骗与谓词关系消除相结合,从而破坏了推理任务之间共享的视觉推理信息。我们的广泛实验表明,VRAP在黑盒在不同视觉推理任务之间的可移植性方面显著超越了以前的方法。
https://arxiv.org/abs/2304.05402
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.
从视频动态场景Graph(SGG)的生成任务变得复杂且具有挑战性,因为场景本身具有动态特性、模型预测时间的随机波动以及视觉关系长长尾分布的特性,而基于图像的SGG已经面临了上述挑战。现有的动态SGG方法主要关注使用复杂的架构捕捉时空上下文,而没有解决上述挑战,特别是关系长长尾分布的问题。这可能导致生成偏差的场景Graph。为了应对这些挑战,我们提出了名为TEMPURA的新框架,它利用集体一致性和记忆原型引导的无偏差动态SGG性能衰减。 Tempura使用对象级别的时间一致性通过Transformer序列建模实现,学习使用记忆引导训练合成无偏差的关系表示,并通过高斯混合模型(GMM)衰减视觉关系预测的不确定性。广泛的实验表明,我们的方法比现有方法实现了显著的性能提升(在某些情况下高达10%),突出了它在生成更多无偏差场景Graph方面的优越性。
https://arxiv.org/abs/2304.00733
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中,我们对 predicate 进行更细致的观察,并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above),而模式级别的分布偏差相对较轻。基于这一认识,我们提出了一种分离标签学习(DLL)范式,从模式级角度解决顽固的视觉关系预测问题。具体而言,DLL 将 predicate 标签分离,并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外,我们提出了一种知识级标签分离方法,从同一模式中的头predicate 到尾predicate 转移非目标知识,以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性,即 VidVRD。广泛的实验表明,DLL 提供了一种非常简单但非常有效的解决方案,解决长尾巴问题,实现 VidSGG 的先进技术表现。
https://arxiv.org/abs/2303.13209