Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at this https URL.
与现实场景中杂乱的交互对机器人代理来说,理解观察对象之间的复杂空间依赖关系以确定最优的抓取序列或有效的对象检索策略面临着挑战。现有的解决方案通常处理简化场景,并专注于在初始物体检测阶段预测成对物体关系,但往往忽视全局上下文或者在处理冗余或缺失的物体关系方面遇到困难。在这项工作中,我们提出了一个现代的视觉关系推理 grasp planning 的视角。我们引入了 D3GD,一种包含 97 个不同类别的 35 个物体的 bin 选择场景。此外,我们提出了 D3G,一种新的端到端 Transformer-based 依赖关系图生成模型,它同时检测物体并生成表示它们空间关系的邻接矩阵。为了识别标准指标的局限性,我们首次使用关系精度(RPN)对模型性能进行评估,进行了一项广泛的实验基准。所得到的结果使我们将其方法确定为这一任务的最新状态,为未来的机器人操作研究奠定了基础。我们公开发布了这段代码和数据集的 URL。
https://arxiv.org/abs/2409.02035
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
要准确理解工程图纸,有必要在图纸中建立图像与其描述表之间的对应关系。现有的文档理解方法主要关注文本作为主要模式,这并不适合包含大量图像信息的文档。在可视关系检测领域,任务的结构使其无法评估图纸中所有实体对之间的关系。为解决这个问题,我们提出了一个基于视觉关系的检测模型,名为ViRED,用于识别电气工程图纸中表与电路之间的关联。我们的模型主要由三个部分组成:一个视觉编码器、一个对象编码器和一个关系解码器。我们使用PyTorch实现ViRED,以评估其性能。为了验证ViRED的有效性,我们进行了一系列实验。实验结果表明,在我们的工程图纸数据集上,我们的方法在关系预测任务上的准确度达到了96%,标志着与现有方法相比取得了显著的改进。结果还显示,即使在一个工程图纸中有大量的物体,ViRED仍可以在快速速度下进行推理。
https://arxiv.org/abs/2409.00909
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
利用音频和视频模态进行视频分类是一个具有挑战性的任务,因为现有的方法需要大型模型架构,导致高计算复杂度和资源需求。较小的架构则很难实现最优性能。在本文中,我们提出了Attend-Fusion,一种专为捕捉视频数据中复杂的音频-视觉关系而设计的音频-视觉(AV)融合方法。通过对具有挑战性的YouTube-8M数据集的广泛实验,我们证明了Attend-Fusion达到75.64\%的F1得分,仅使用72M参数,与具有类似性能的大型基线模型(如Fully-Connected Late Fusion,75.96\% F1 score,341M parameters)相当。Attend-Fusion在大型基线模型的同时减小了模型大小,凸显了其在模型复杂度方面的效率。我们的工作表明,Attend-Fusion模型能够有效地结合音频和视频信息进行视频分类,实现与显著减小模型大小相当的竞争性能。这种方法为在各种应用环境中部署高效的视频理解系统提供了新的可能性。
https://arxiv.org/abs/2408.14441
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{this https URL}{this https URL}}.
视频视觉关系检测(VidVRD)关注实体在视频中的交互,是深入了解视频场景的关键步骤,超出了基本的视觉任务。传统方法在面对其复杂性时,通常将任务分为两个部分:一个是确定关系类别的存在,另一个是确定它们的时域边界。这种划分忽略了这些元素之间的固有联系。为了识别跨越不同持续时间的关系实体对,我们提出了VrdONE,一种简洁而有效的单阶段模型。VrdONE结合了主题和对象的特征,将谓词检测转换为他们联合表示的1D实例分割。这个设置允许在一次性识别关系类别和生成二进制掩码的同时,消除需要提议生成或后处理等额外步骤的需求。VrdONE在各种帧之间的特征交互方面表现出色,能够捕捉到短暂的和持久的关系。此外,我们引入了主题-对象协同(SOS)模块,提高了主题和对象在结合前如何相互感知。VrdONE在VidOR基准和ImageNet-VidVRD上实现了最先进的性能,展示了其在不同时间尺度上分辨关系的卓越能力。代码可在此处获得:\textcolor[R{228,58,136}]{\href{this <https://this <https://this> URL>}{this <https://this> URL}}。
https://arxiv.org/abs/2408.09408
Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 $mAP$ on HICO-DET, +11.4 $Acc$ on VRD, +4.7 $mAP$ on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.
视觉关系理解已经在人机交互(HOI)检测、场景图生成(SGG)和引用关系(RR)任务中进行了单独研究。考虑到这些任务的复杂性和相互关联性,有必要提供一个灵活的框架,能够以一种集成的方式有效地解决这些任务。在本文中,我们提出了FleVRS,一种将上述三个方面无缝集成在标准和提示性视觉关系分割中的单一模型,并进一步具有适应新场景的开放词汇分割能力。FleVRS利用文本和图像模态之间的协同作用,将各种类型的关系从图像中 grounded,并使用视觉语言模型中的文本特征进行视觉概念理解。通过不同数据集的实证验证,我们的框架在标准、提示性和开放词汇任务中优于现有模型,例如+1.9 $mAP$在HICO-DET,+11.4 $Acc$在VRD,+4.7 $mAP$在未见过的HICO-DET上。FleVRS代表了一个明显朝着更直观、全面和可扩展的视觉关系理解方向迈出的重要一步。
https://arxiv.org/abs/2408.08305
Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by leveraging the relationships and shared cues between the audio-visual modalities. As a result, our method can provide accurate answers by effectively utilizing available information even when input modalities are missing. We believe our method holds potential applications not only in AVQA research but also in various multi-modal scenarios.
https://arxiv.org/abs/2407.16171
Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.
文本基于人员搜索(TBPS)是一个在研究社区中引起广泛关注的问题。任务是根据文本描述检索一或多个特定个人的图像。任务的多样性要求学习在共享潜在空间中连接文本和图像数据的表示。现有的TBPS系统面临着两个主要挑战。一个是由于文本描述的固有模糊和不精确性而产生的身份混淆噪声,它表明了视觉属性的描述如何通常与不同的人相关联;另一个是内 Identity Variations,它们都是那些例如姿态、照明等可以改变给定主题文本属性的视觉外观的细微差别。为了应对这些问题,本文提出了一种名为MARS( Mae-Attribute-Relation-Sensitive)的新TBPS架构,它通过引入两个关键组件来增强现有技术的水平:视觉重构损失和属性损失。前一个采用基于文本描述的随机遮罩自动编码器来重构图像补丁。这样做,模型被鼓励在潜在空间中学习更富有表现力的表示和文本-视觉关系。相反,属性损失平衡了不同属性的贡献,这些属性定义为形容词短语文本。这种损失确保了在人员检索过程中考虑到了每个属性。在三个常用的数据集(CUHK-PEDES,ICFG-PEDES和RSTPReid)上进行的大量实验报告显示,性能得到了提高,特别是平均精度(mAP)指标与现有技术的水平相比显著增益。
https://arxiv.org/abs/2407.04287
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
幻觉问题一直是现有大型视觉语言模型(LVLMs)中的一个普遍关注点。之前的努力主要集中在研究物体幻觉,通过引入物体检测器可以轻松缓解这种幻觉。然而,这些努力忽视了物体关系幻觉,这对于视觉理解是至关重要的。在这项工作中,我们引入了R-Bench,一种用于评估视觉关系幻觉的新基准。R-Bench包括针对关系和实例水平的图像级问题,这些问题关注关系的存在以及评估局部视觉理解。我们确定了三种导致幻觉的关系共现类型:关系关系、主体关系和关系物体。视觉指令调整数据集的长尾分布显著影响了LVLMs对视觉关系的理解。此外,我们的分析发现,当前的LVLMs往往忽视视觉内容,过于依赖常识知识的大型语言模型。他们还难以根据上下文信息进行空间关系推理。
https://arxiv.org/abs/2406.16449
Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failure points at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
尽管视觉 transformer(ViTs)在各种设置中已经取得了最先进的性能,但在涉及视觉关系任务的操作中,它们表现出了令人惊讶的失败。这引发了一个问题:ViTs 尝试如何执行涉及计算物体之间视觉关系任务的任务?之前对 ViTs 的解释性尝试通常集中在刻画相关低级视觉特征上。相比之下,我们采用机制解释性方法研究 ViTs 使用的高层次视觉算法来进行抽象视觉推理。我们举一个基本但非常困难的关系推理任务为例:判断两个视觉实体是否相同或不同。我们发现,预训练的 ViTs 在这个任务上进行微调时,往往表现出两种质量不同的处理阶段,尽管它们没有明显的归纳偏见:1)感知阶段,其中局部物体特征被提取并存储在分离表示中;2)关系阶段,其中物体表示进行比较。在第二阶段,我们发现 ViTs 确实可以学会表示 somewhat abstract visual relations,这是人工智能神经网络长期以来认为不可能的能力。最后,我们证明了在第二个阶段,失败点可以阻止模型学习到我们任务的高层次通用解决方案。通过理解 ViTs 在离散处理阶段,我们可以更精确地诊断和纠正现有和未来模型的不足。
https://arxiv.org/abs/2406.15955
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at \url{this https URL}.
动态场景图生成(DSGG)关注视频的空间-时间域内的视觉关系。传统的解决方案通常采用多阶段流程,通常包括目标检测、时间关联和多关系分类。然而,由于多个阶段的分离,这些方法存在固有局限性,而独立优化这些子问题可能会产生次优解决方案。为了弥补这些局限性,我们提出了一个端到端的框架,称为OED,该框架简化了DSGG管道。该框架将任务重新建模为预测问题,并利用成对特征表示场景图中的每个主题-对象对。此外,DSGG的一个挑战是捕捉时间依赖关系,我们引入了一个逐渐精炼的模块(PRM),用于汇总没有额外跟踪器或手工制作的轨迹的元数据,从而实现网络端到端的优化。在Action Genome基准上进行的大量实验证明了我们设计的有效性。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2405.16925
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
场景图生成(SGG)旨在将视觉场景分解为中间图表示,以供下游推理任务使用。尽管最近取得了进展,但现有的方法在生成具有新颖视觉关系概念的场景图时仍存在困难。为解决这一挑战,我们引入了一种基于序列生成的全新开放词汇SGG框架。我们的框架利用了视觉语言预训练模型(VLM),并引入了图像到图生成范式。具体来说,我们通过VLM的图像到文本生成生成场景图序列,然后从这些序列中构建场景图。通过这样做,我们充分利用VLM的强大的能力实现开放词汇的SGG,并通过显式关系建模增强VL任务的性能。实验结果表明,我们的设计不仅实现了更好的性能,而且通过显式关系建模知识增强了下游视觉语言任务的表现。
https://arxiv.org/abs/2404.00906
Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: this https URL.
尽管它们具有出色的生成能力,大型文本到图像扩散模型(如熟练但粗心的艺术家)通常在准确描绘物体之间的视觉关系方面遇到困难。通过仔细分析,我们发现这一问题源于一个失衡的文本编码器,它难以解释具体的关系,并区分相关对象的逻辑顺序。为解决这个问题,我们引入了一个名为关系纠正的新任务,旨在优化模型以准确表示其最初无法生成的关系。为解决这一问题,我们提出了一种创新的方法利用异质图卷积网络(HGCN)。它通过输入提示来建模关系词汇之间和相应物体之间的方向关系。具体来说,我们在一对具有相同关系词但反向物体顺序的提示上优化HGCN,并补充了几个参考图像。轻量级的HGCN调整了由文本编码器生成的文本嵌入,确保了文本中关系的准确映射在嵌入空间中的反映。关键的是,我们的方法保留了文本编码器和解扩散模型的参数,保持模型在不相关描述上的稳健性能。我们在包含多样关系数据的新数据集中评估了我们的方法,证明了在生成精确视觉关系图片方面 both quantitative and qualitative enhancements。项目页面:this <https://this URL>.
https://arxiv.org/abs/2403.20249
Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at this https URL.
视觉关系检测(VRD)在最近使用Transformer架构取得了显著的进步。然而,我们发现在传统的标签分配过程中存在两个关键限制,这是将真实(GT)映射到预测的过程。在传统分配中,由于预计查询需要检测每个关系,因此查询需要 specialized。此外,由于 GT 只分配给单个预测,因此即使预测接近正确或正确,也会因为分配无关系而被压制。为了解决这些问题,我们提出了组内查询专业化和质量感知多分配(SpeaQ)。组内查询专业化通过将查询和关系划分为独立组,并仅将查询定向向相应关系组中的关系来训练专用查询。质量感知多分配进一步通过将一个 GT 分配给多个预测,这些预测与 GT 在主体、对象和它们之间的关系上非常接近,来促进训练。实验结果和分析表明,SpeaQ有效地训练了专用查询,这更好地利用了模型的能力,从而在多个 VRD 模型和基准上实现了显著的性能提升。代码可在此处访问:https://www.aclweb.org/anthology/N22-11969
https://arxiv.org/abs/2403.17709
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. While previous works attempt to exploit the verbo-visual relation with proposed cross-modal transformers, unstructured natural utterances and scattered objects might lead to undesirable performances. In this paper, we introduce DOrA, a novel 3D visual grounding framework with Order-Aware referring. DOrA is designed to leverage Large Language Models (LLMs) to parse language description, suggesting a referential order of anchor objects. Such ordered anchor objects allow DOrA to update visual features and locate the target object during the grounding process. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in both low-resource and full-data scenarios. In particular, DOrA surpasses current state-of-the-art frameworks by 9.3% and 7.8% grounding accuracy under 1% data and 10% data settings, respectively.
3D视觉 grounded 旨在通过自然语言描述中的目标对象,在3D点云场景中确定目标对象。然而,以前的工作试图利用所提出的跨模态变换器利用动词-视觉关系,但无结构的自然语句和分散的对象可能会导致不良的性能。在本文中,我们引入了DOrA,一种新颖的3D视觉 grounded 框架,具有Order-Aware参考。DOrA旨在利用大型语言模型(LLMs)解析语言描述,建议锚对象之间的参照顺序。这样的有序锚对象允许DOrA在 grounding 过程中更新视觉特征并定位目标对象。在NR3D和ScanRefer数据集上的实验结果证实了我们在低资源和高资源场景中的卓越性。特别地,DOrA在1%数据和10%数据设置下的grounding准确度分别比现有最先进框架高9.3%和7.8%。
https://arxiv.org/abs/2403.16539
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
视觉关系检测旨在在图像中识别物体及其关系。之前的方法通过向现有的物体检测架构中添加单独的关系模块或解码器来解决这个问题。这种分离增加了复杂度,并阻碍了端到端训练,这限制了性能。我们提出了一种简单的且高效的无解码器架构,用于开放词汇的视觉关系检测。我们的模型包括一个基于Transformer的图像编码器,它将物体表示为标记,并隐含地建模它们之间的关系。为了提取关系信息,我们引入了一个注意力机制,选择可能形成关系的物体对。我们提供了一种在混合物体和关系检测数据上训练此模型的单阶段 recipe。我们的方法在实时推理速度下实现了视觉基因组和大型词汇GQA基准中的最先进关系检测性能。我们还提供了关于零散性能、消融和真实世界质量实例的分析。
https://arxiv.org/abs/2403.14270
The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual contents and ground text to them. Nonetheless, current LVLMs still struggle to precisely understand visual relations due to the lack of relevant data. In this work, we present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations, temporal associations and geometric transforms. Extensive case studies and quantitative evaluations show RelationVLM has strong capability in understanding such relations and emerges impressive in-context capability of reasoning from few-shot examples by comparison. This work fosters the advancements of LVLMs by enabling them to support a wider range of downstream applications toward artificial general intelligence.
大视图语言模型的(LVLM)发展试图赶超大型语言模型(LLMs)的成功,然而要解决的问题还很多。最近的工作使LVLM能够将物体级视觉内容进行本地化,并将文本与它们绑定。然而,由于缺乏相关数据,当前的LVLM仍然很难精确理解视觉关系。在本文中,我们提出了关系VLM,一种大型视觉语言模型,可以理解各种关系,无论是跨越多张图片还是在一个视频中。具体来说,我们设计了一个多阶段关系感知训练计划和一系列相应的数据配置策略,以赋予关系VLM理解语义关系、时间关联和几何变换的能力。大量案例研究和定量评估表明,关系VLM在理解这些关系方面具有很强的能力,并且在从少样本情况下进行推理时表现出色。这项工作通过使LVLM支持更广泛的下游应用,促进了其发展。
https://arxiv.org/abs/2403.12801
Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
机器从图像和视频中理解视觉信息的主要挑战有两个。首先,在连接视觉和语言之间存在计算和推理差距,这使得准确确定给定代理对哪个对象进行操作并将其通过语言表示为困难。其次,由单个单体神经网络训练的分类器通常缺乏稳定性和泛化能力。为了克服这些挑战,我们引入了MoE-VRD,一种利用专家混合的新视觉关系检测方法。MoE-VRD以<主体,谓词,对象>元组的形式识别视觉处理中的语言三元组以提取关系。利用最近在视觉关系检测方面的进展,MoE-VRD在建立主体(进行操作)与物体(被操作)之间的关系方面解决了动作识别的要求。与单体网络相比,MoE-VRD采用多个小模型作为专家,其输出进行聚合。每个专家在MoE-VRD专门研究视觉关系学习和对象标记。通过使用稀疏门控的专家混合,MoE-VRD实现了条件计算,显著增强了神经网络能力,而不会增加计算复杂度。我们的实验结果表明,条件计算能力和可扩展性是专家混合方法的优越性能在视觉关系检测方面比最先进的方法更显著。
https://arxiv.org/abs/2403.03994
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at this https URL
条件扩散模型在高端文本指导下的视觉生成和编辑方面表现出了卓越的性能。然而,当前主要关注于将文本-视觉关系仅引入反向过程,而忽视其在正向过程的 relevance。这种正向和反向过程之间的不一致性可能限制了文本合成结果中精确传达文本语义的能力。为了解决这个问题,我们提出了一个新颖且一般化的条件扩散模型(ContextDiff),通过将跨模态上下文涵盖文本条件和视觉样本之间的交互和匹配引入到正向和反向过程中,从而实现文本条件下的扩散。我们将这个上下文传递到两个过程的所有时间步,以适应它们的轨迹,从而促进跨模态条件建模。我们对DDPM和DDIMs进行了理论推导,并展示了我们的模型在两个具有挑战性的任务上的效果:文本到图像生成和文本到视频编辑。在每项任务中,我们的ContextDiff都实现了最先进的性能,显著增强了文本条件和生成样本之间的语义对齐,正如定量和定性评估所证明的。我们的代码可以从该链接下载:
https://arxiv.org/abs/2402.16627
Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of generalization of the relations learned. However, in recent years, object-centric representation learning has been put forward as a way to achieve visual reasoning within the deep learning framework. Object-centric models attempt to model input scenes as compositions of objects and relations between them. To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects. In this work we tested relation learning and generalization in several object-centric models, as well as a ResNet-50 baseline. In contrast to previous research, which has focused heavily in the same-different task in order to asses relational reasoning in DNNs, we use a set of tasks -- with varying degrees of difficulty -- derived from the comparative cognition literature. Our results show that object-centric models are able to segregate the different objects in a scene, even in many out-of-distribution cases. In our simpler tasks, this improves their capacity to learn and generalize visual relations in comparison to the ResNet-50 baseline. However, object-centric models still struggle in our more difficult tasks and conditions. We conclude that abstract visual reasoning remains an open challenge for DNNs, including object-centric models.
实现视觉推理是人工智能的一个长期目标。在过去的十年里,几项研究将深度神经网络(DNNs)应用于从图像中学习视觉关系,虽然这些模型的泛化关系有所提高,但近年来,以物体为中心的表示学习作为一种在深度学习框架内实现视觉推理的方法被提出。物体中心模型试图将输入场景建模为物体和它们之间的关系的组合。为此,这些模型使用多种关注机制将场景中的单个物体从背景和与其他物体区分。在这项工作中,我们测试了关系学习和泛化在多个物体中心模型以及一个ResNet-50基线上的效果。与之前的研究不同,该研究关注的是相同不同任务,以评估DNNs中的关系推理。我们的结果表明,物体中心模型能够将场景中的不同物体进行区分,即使在很多分布不下的情况下也是如此。在我们的简单任务中,这使得物体中心模型能够更好地学习和泛化视觉关系,与ResNet-50基线相比提高了其能力。然而,在更困难的任务和条件下,物体中心模型仍然存在困难。我们得出结论,对于DNNs来说,抽象视觉推理仍然是一个未解决的问题,包括物体中心模型。
https://arxiv.org/abs/2402.12675