Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we add supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional views of images. With masked relation prediction, we further encourage relating entities from visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.
最近的 vision-and-language 预训练工作研究了从对象检测数据中学习的指令信号，以学习更精细的多模式表示。在这项工作中，我们进一步深入研究了从小型视觉关系数据中增加监督的方法。特别是，我们提出了两种预训练方法，以在一个多模式setup中上下文化视觉实体。通过语音化场景Graph，我们将视觉关系三视图转换为结构化标题，并将其视为图像额外的视角。通过掩码关系预测，我们进一步鼓励从视觉上掩码的上下文中相关的实体。将这些方法应用于大量的 Web 数据预训练的强基线，对粗粒度和精粒度任务进行的零样本评估表明，我们的方法从弱监督的关系数据学习多模式表示的有效性。
Visual relation extraction (VRE) aims to extract relations between entities from visuallyrich documents. Existing methods usually predict relations for each entity pair independently based on entity features but ignore the global structure information, i.e., dependencies between entity pairs. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner. Given a scanned image of the document, GOSE firstly generates preliminary relation predictions on entity pairs. Secondly, it mines global structure knowledge based on prediction results of the previous iteration and further incorporates global structure knowledge into entity representations. This "generate-capture-incorporate" schema is performed multiple times so that entity representations and global structure knowledge can mutually reinforce each other. Extensive experiments show that GOSE not only outperforms previous methods on the standard fine-tuning setting but also shows promising superiority in cross-lingual learning; even yields stronger data-efficient performance in the low-resource setting.
视觉关系提取(VRE)旨在从视觉丰富的文档中提取实体之间的关系。现有的方法通常基于实体特征独立地预测每个实体对之间的关系，但忽略了全球结构信息，即实体对之间的依赖关系。缺乏全球结构信息可能导致模型 struggle 学习远程关系，并容易预测冲突的结果。为了减轻这些限制，我们提出了一个GLObal Structure知识 guided关系提取(GOSE)框架，该框架以迭代方式捕获实体对之间的依赖关系。给定文档的扫描图像，GOSE首先在实体对上生成初步的关系预测。其次，它基于先前迭代的预测结果挖掘全球结构知识，并进一步将全球结构知识融入实体表示中。这种“生成-捕获-融合” schema 多次执行，以便实体表示和全球结构知识可以互相加强。广泛的实验表明，GOSE不仅在标准微调环境中比先前方法表现更好，而且在跨语言学习中表现出令人瞩目的优越性；甚至在资源匮乏的环境中表现出更强的数据高效性。
NeSy4VRD is a multifaceted resource designed to support the development of neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the images of the VRD dataset and couples them with an extensively revised, quality-improved version of the VRD visual relationship annotations. Crucially, NeSy4VRD provides a well-aligned, companion OWL ontology that describes the dataset this http URL comes with open source infrastructure that provides comprehensive support for extensibility of the annotations (which, in turn, facilitates extensibility of the ontology), and open source code for loading the annotations to/from a knowledge graph. We are contributing NeSy4VRD to the computer vision, NeSy and Semantic Web communities to help foster more NeSy research using OWL-based knowledge graphs.
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.
训练有素的视觉语言模型，如CLIP，已经表现出强大的泛化能力，使其成为零次视觉识别领域有前途的工具。视觉关系检测(VRD)是一种常见的任务，该任务在图像中识别关系(或交互)类型的不同细节类型。然而，天真地使用基于类的Clip提示进行零次VRD有以下几个弱点，例如，它 struggle 很难区分不同细致的关系类型，并且它忽略了两个对象的重要空间信息。为此，我们提出了一种零次VRD的新型方法：RECODE，该方法通过合并描述提示解决关系检测。具体来说，RECODE首先将每个谓词类别分解为主题、对象和空间组件。然后，它利用大型语言模型(LLM)生成每个组件的描述性提示(或视觉提示)。不同的视觉提示增强从不同的角度识别类似关系类别的可区分性，这显著提高了VRD的性能。为了动态融合不同的提示，我们引入了一种思考链方法，promptLLM生成合理的视觉提示权重。在四个VRD基准实验中，已经证明了RECODE的有效性和解释性。
Visual reasoning is a long-term goal of vision research. In the last decade, several works have attempted to apply deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of the generalization of the relations learned. In recent years, several innovations in DNNs have been developed in order to enable learning abstract relation from images. In this work, we systematically evaluate a series of DNNs that integrate mechanism such as slot attention, recurrently guided attention, and external memory, in the simplest possible visual reasoning task: deciding whether two objects are the same or different. We found that, although some models performed better than others in generalizing the same-different relation to specific types of images, no model was able to generalize this relation across the board. We conclude that abstract visual reasoning remains largely an unresolved challenge for DNNs.
视觉推理是视觉研究的长期目标。在过去的十年中，有几种研究尝试将深度学习神经网络(DNN)应用于从图像中学习视觉关系的任务，但所取得的 generalization 效果相对较低。近年来，DNN 中几项创新已经被开发出来，以便从图像中学习抽象关系。在本研究中，我们系统地评估了一系列 DNN，这些 DNN 集成了例如窗体注意力、循环引导注意力和外部记忆等机制，在最简单的视觉推理任务中：决定两个物体是否相同或不同。我们发现，虽然某些模型在将相同-不同关系泛化到特定类型的图像方面表现更好，但没有任何模型能够在所有情况下泛化 this 关系。我们得出结论，抽象的视觉推理仍然是 DNN 面临的未解决挑战。
The transferability of adversarial examples is a crucial aspect of evaluating the robustness of deep learning systems, particularly in black-box scenarios. Although several methods have been proposed to enhance cross-model transferability, little attention has been paid to the transferability of adversarial examples across different tasks. This issue has become increasingly relevant with the emergence of foundational multi-task AI systems such as Visual ChatGPT, rendering the utility of adversarial samples generated by a single task relatively limited. Furthermore, these systems often entail inferential functions beyond mere recognition-like tasks. To address this gap, we propose a novel Visual Relation-based cross-task Adversarial Patch generation method called VRAP, which aims to evaluate the robustness of various visual tasks, especially those involving visual reasoning, such as Visual Question Answering and Image Captioning. VRAP employs scene graphs to combine object recognition-based deception with predicate-based relations elimination, thereby disrupting the visual reasoning information shared among inferential tasks. Our extensive experiments demonstrate that VRAP significantly surpasses previous methods in terms of black-box transferability across diverse visual reasoning tasks.
对抗性样本的可移植性是评估深度学习系统鲁棒性的一个重要方面，特别是在黑盒场景下。尽管已经提出了多种方法来增强跨模型可移植性，但很少关注不同任务之间的对抗性样本可移植性。这一问题随着基于视觉任务的 foundation AI系统(如Visual ChatGPT)的出现而变得越来越重要，使得单个任务生成的对抗样本的实用性相对有限。此外，这些系统通常还包括超越仅仅识别相似的任务推理函数。为了解决这一差距，我们提出了一种基于视觉关系的新跨任务对抗块生成方法称为 VRAP，旨在评估各种视觉任务，特别是涉及视觉推理的任务(如视觉问答和图像标题)的鲁棒性。VRAP使用场景图将对象识别基于欺骗与谓词关系消除相结合，从而破坏了推理任务之间共享的视觉推理信息。我们的广泛实验表明，VRAP在黑盒在不同视觉推理任务之间的可移植性方面显著超越了以前的方法。
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近，有日益增长的需求，通过反转示例图像中的扩散模型来生成定制图像。然而，现有的反转方法主要关注捕捉对象外观。如何反转对象关系，视觉世界中的另一个重要支柱，仍未被探索。在本研究中，我们提出了关系反转任务 ReVersion，旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说，我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说，我们提出了一种关系引导的Contrastive学习策略，以强加关系提示的两个关键特性：1) 关系提示应该捕捉对象之间的交互，由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略，强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务，我们贡献了 ReVersion 基准，提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中，我们对 predicate 进行更细致的观察，并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above)，而模式级别的分布偏差相对较轻。基于这一认识，我们提出了一种分离标签学习(DLL)范式，从模式级角度解决顽固的视觉关系预测问题。具体而言，DLL 将 predicate 标签分离，并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外，我们提出了一种知识级标签分离方法，从同一模式中的头predicate 到尾predicate 转移非目标知识，以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性，即 VidVRD。广泛的实验表明，DLL 提供了一种非常简单但非常有效的解决方案，解决长尾巴问题，实现 VidSGG 的先进技术表现。
Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image. A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space. The results of extensive experimentation on the MSCOCO dataset show the effectiveness of using visual relationships in the proposed captioning method. Moreover, the results clearly indicate that the proposed multi-modal reward in deep reinforcement learning leads to better model optimization, outperforming several state-of-the-art image captioning algorithms, while using light and easy to extract image features. A detailed experimental study of the components constituting the proposed method is also presented.
神经网络在自动图像翻译中取得了令人瞩目的成果,因为它们有效地学习了表示和基于上下文的内容生成能力。作为最近在许多图像翻译方法中广泛应用的深度特征类型,著名的bottom-up特征提供了与从原始图像直接提取的特征映射相比更详细的图像对象表示。然而,缺乏这些对象之间的高级语义信息是bottom-up特征的一个重要缺点,尽管它们的提取程序昂贵且资源要求高。为了利用图像关系在翻译生成中的作用,本文提出了一种基于融合图像场景图提取的视觉关系信息与图像空间特征映射的深度学习神经网络架构。然后,在共同嵌入空间中通过语言和视觉相似性的组合引入一种多模态奖励函数,用于训练 proposed 网络的深度强化学习。对MSCOCO数据集进行广泛的实验结果显示,使用视觉关系在所提出的翻译方法中的有效性。此外,实验结果清楚地表明,所提出的深度强化学习多模态奖励导致更好的模型优化,比一些最先进的图像翻译算法表现更好,同时使用简单易提取的图像特征。还介绍了组成所提出方法的详细实验研究的组件。
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 60% relatively. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model.
这项工作重点是训练一个单一的视觉关系检测器,从多个数据集的label空间中预测。由于不同的分类器定义不一致,将跨越不同数据集的labels合并起来可能会非常困难。当两个物体之间的二阶视觉语义引入时,这个问题会更加突出。为了解决这个挑战,我们提出了UniVRD,一种利用视觉和语言模型(VLMs)的新方法,以统一的视觉关系检测为目标。VLMs提供对齐的图像和文本嵌入,其中相似关系被优化到彼此相邻以提高语义统一性。我们的bottom-up设计使模型能够同时训练物体检测和视觉关系数据集。对人类-物体交互检测和场景生成的主观结果 both human-object interaction detection and scene-graph generation 均证明了我们的模型的竞争性表现。UniVRD在HICO-DET上实现38.07mAP的性能,相对于当前最好的bottom-up HOI检测器高出60%。更重要的是,我们表明,我们的统一检测器在mAP方面的表现与数据集特定的模型相当,当我们扩大模型规模时还能取得进一步改进。
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. Some recent papers tackle this problem by few-shot learning with elaborately designed pipelines and pre-trained word vectors. However, the performance of existing few-shot VRD models is severely hampered by the poor generalization capability, as they struggle to handle the vast semantic diversity of visual relationships. Nonetheless, humans have the ability to learn new relationships with just few examples based on their knowledge. Inspired by this, we devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge to improve the generalization ability of few-shot VRD. The textual knowledge and visual relation knowledge are acquired from a pre-trained language model and an automatically constructed visual relation knowledge graph, respectively. We extensively validate the effectiveness of our framework. Experiments conducted on three benchmarks from the commonly used Visual Genome dataset show that our performance surpasses existing state-of-the-art models with a large improvement.
视觉关系检测(VRD)旨在检测对象之间的图像理解关系。目前,大多数VRD方法依赖于每个关系数千个训练样本来实现良好的性能。一些最近的论文通过精心设计的流程和预训练的词向量来解决这个问题。然而,现有的少数几次VRD模型的表现受到 poor 的泛化能力极大的限制,因为它们努力处理视觉关系的巨大语义多样性。然而,人类有一种能力,仅基于他们的知识,通过几个示例学习新的关系。受到这个想法的启发,我们设计了一个知识增强的少数几次VRD框架,利用文本知识和视觉关系知识来提高少数几次VRD的泛化能力。文本知识和视觉关系知识从预训练的语言模型和自动生成的视觉关系知识图获取。我们广泛验证我们的框架的有效性。从常用的视觉基因组数据集常用的三个基准点开始,进行了实验,结果表明我们的性能远远超过现有的最先进的模型,取得了巨大的改进。
Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary predictions trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatio-temporal motion patterns of the subject-object compositions. Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompts. Code is available at this https URL.
使用大规模预训练的视觉语言模型,可以赋予基于有限基类的开放词汇量预测能力,例如对象分类和检测。在本文中,我们提出了基于运动的构造性promptTuning:一种扩展了promptTuning的视频数据构造性预测范式。特别是,我们提出了Open-VidVRD(Open-VidVRD)的关联prompt(RePro),该promptTuning在传统promptTuning中容易偏向某些主题-对象组合和运动模式。为此,RePro解决了Open-VidVRD的两个技术挑战:1)prompt tokens应尊重主题和对象两个不同语义角色;2)Tuning应考虑到主题-对象组合的多种时间和空间运动模式。此外,我们没有添加任何花哨的功能,我们的RePro在两个 VidVRD基准测试中取得了新的先进技术表现,不仅基训练对象和候选类别,还包括未曾见过的对象。此外, extensive ablations还证明了我们所提出的构造性和多模式设计prompt的有效性。代码在此httpsURL可用。
Recent scene graph generation (SGG) frameworks have focused on learning complex relationships among multiple objects in an image. Thanks to the nature of the message passing neural network (MPNN) that models high-order interactions between objects and their neighboring objects, they are dominant representation learning modules for SGG. However, existing MPNN-based frameworks assume the scene graph as a homogeneous graph, which restricts the context-awareness of visual relations between objects. That is, they overlook the fact that the relations tend to be highly dependent on the objects with which the relations are associated. In this paper, we propose an unbiased heterogeneous scene graph generation (HetSGG) framework that captures relation-aware context using message passing neural networks. We devise a novel message passing layer, called relation-aware message passing neural network (RMP), that aggregates the contextual information of an image considering the predicate type between objects. Our extensive evaluations demonstrate that HetSGG outperforms state-of-the-art methods, especially outperforming on tail predicate classes.
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in a scene graph. A typical natural scene contains several objects, often exhibiting visual relationships of varied complexities between them. These inter-object relationships provide strong contextual cues toward improving grounding performance compared to a traditional object query-only-based localization task. A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image. In an attempt towards bridging these two modalities representing scenes and utilizing contextual information for improving object localization, we rigorously study the problem of grounding scene graphs on natural images. To this end, we propose a novel graph neural network-based approach referred to as Visio-Lingual Message PAssing Graph Neural Network (VL-MPAG Net). In VL-MPAG Net, we first construct a directed graph with object proposals as nodes and an edge between a pair of nodes representing a plausible relation between them. Then a three-step inter-graph and intra-graph message passing is performed to learn the context-dependent representation of the proposals and query objects. These object representations are used to score the proposals to generate object localization. The proposed method significantly outperforms the baselines on four public datasets.
Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only distil complex image information but also correct the bias of visual models using semantic-level relations, which has broad application prospects. However, the heavy labour cost of constructing graph annotations may hinder the application of PSG in practical scenarios. Inspired by the observation that people usually identify the subject and object first and then determine the relationship between them, we proposed to decouple the scene graphs generation task into two sub-tasks: 1) an image segmentation task to pick up the qualified objects. 2) a restricted auto-regressive text generation task to generate the relation between given objects. Therefore, in this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model, which achieved 31 points on the OpenPSG dataset and outperforms strong baselines respectively by 16 points (ResNet-50) and 5 points (CLIP).
For humans, understanding the relationships between objects using visual signals is intuitive. For artificial intelligence, however, this task remains challenging. Researchers have made significant progress studying semantic relationship detection, such as human-object interaction detection and visual relationship detection. We take the study of visual relationships a step further from semantic to geometric. In specific, we predict relative occlusion and relative distance relationships. However, detecting these relationships from a single image is challenging. Enforcing focused attention to task-specific regions plays a critical role in successfully detecting these relationships. In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on occlusion-specific regions; 3) our model achieves a new state-of-the-art performance on distance-aware relationship detection. Specifically, our model increases the distance F1-score from 33.8% to 38.6% and boosts the occlusion F1-score from 34.4% to 41.2%. Our code is publicly available.
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. The prevailing SGG methods require all object classes to be given in the training set. Such a closed setting limits the practical application of SGG. In this paper, we introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes but is required to infer relations for unseen target object classes. To this end, we propose a two-step method that firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters. Moreover, our method can support inference over completely unseen object classes, which existing methods are incapable of handling. On extensive experiments on three benchmark datasets, Visual Genome, GQA, and Open-Image, our method significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.
Visual relationship detection aims to detect the interactions between objects in an image; however, this task suffers from combinatorial explosion due to the variety of objects and interactions. Since the interactions associated with the same object are dependent, we explore the dependency of interactions to reduce the search space. We explicitly model objects and interactions by an interaction graph and then propose a message-passing-style algorithm to propagate the contextual information. We thus call the proposed method neural message passing (NMP). We further integrate language priors and spatial cues to rule out unrealistic interactions and capture spatial interactions. Experimental results on two benchmark datasets demonstrate the superiority of our proposed method. Our code is available at this https URL.
The Scene Graph Generation (SGG) task aims to detect all the objects and their pairwise visual relationships in a given image. Although SGG has achieved remarkable progress over the last few years, almost all existing SGG models follow the same training paradigm: they treat both object and predicate classification in SGG as a single-label classification problem, and the ground-truths are one-hot target labels. However, this prevalent training paradigm has overlooked two characteristics of current SGG datasets: 1) For positive samples, some specific subject-object instances may have multiple reasonable predicates. 2) For negative samples, there are numerous missing annotations. Regardless of the two characteristics, SGG models are easy to be confused and make wrong predictions. To this end, we propose a novel model-agnostic Label Semantic Knowledge Distillation (LS-KD) for unbiased SGG. Specifically, LS-KD dynamically generates a soft label for each subject-object instance by fusing a predicted Label Semantic Distribution (LSD) with its original one-hot target label. LSD reflects the correlations between this instance and multiple predicate categories. Meanwhile, we propose two different strategies to predict LSD: iterative self-KD and synchronous self-KD. Extensive ablations and results on three SGG tasks have attested to the superiority and generality of our proposed LS-KD, which can consistently achieve decent trade-off performance between different predicate categories.