Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
场景图生成(SGG)旨在将视觉场景分解为中间图表示,以供下游推理任务使用。尽管最近取得了进展,但现有的方法在生成具有新颖视觉关系概念的场景图时仍存在困难。为解决这一挑战,我们引入了一种基于序列生成的全新开放词汇SGG框架。我们的框架利用了视觉语言预训练模型(VLM),并引入了图像到图生成范式。具体来说,我们通过VLM的图像到文本生成生成场景图序列,然后从这些序列中构建场景图。通过这样做,我们充分利用VLM的强大的能力实现开放词汇的SGG,并通过显式关系建模增强VL任务的性能。实验结果表明,我们的设计不仅实现了更好的性能,而且通过显式关系建模知识增强了下游视觉语言任务的表现。
https://arxiv.org/abs/2404.00906
Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: this https URL.
尽管它们具有出色的生成能力,大型文本到图像扩散模型(如熟练但粗心的艺术家)通常在准确描绘物体之间的视觉关系方面遇到困难。通过仔细分析,我们发现这一问题源于一个失衡的文本编码器,它难以解释具体的关系,并区分相关对象的逻辑顺序。为解决这个问题,我们引入了一个名为关系纠正的新任务,旨在优化模型以准确表示其最初无法生成的关系。为解决这一问题,我们提出了一种创新的方法利用异质图卷积网络(HGCN)。它通过输入提示来建模关系词汇之间和相应物体之间的方向关系。具体来说,我们在一对具有相同关系词但反向物体顺序的提示上优化HGCN,并补充了几个参考图像。轻量级的HGCN调整了由文本编码器生成的文本嵌入,确保了文本中关系的准确映射在嵌入空间中的反映。关键的是,我们的方法保留了文本编码器和解扩散模型的参数,保持模型在不相关描述上的稳健性能。我们在包含多样关系数据的新数据集中评估了我们的方法,证明了在生成精确视觉关系图片方面 both quantitative and qualitative enhancements。项目页面:this <https://this URL>.
https://arxiv.org/abs/2403.20249
Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at this https URL.
视觉关系检测(VRD)在最近使用Transformer架构取得了显著的进步。然而,我们发现在传统的标签分配过程中存在两个关键限制,这是将真实(GT)映射到预测的过程。在传统分配中,由于预计查询需要检测每个关系,因此查询需要 specialized。此外,由于 GT 只分配给单个预测,因此即使预测接近正确或正确,也会因为分配无关系而被压制。为了解决这些问题,我们提出了组内查询专业化和质量感知多分配(SpeaQ)。组内查询专业化通过将查询和关系划分为独立组,并仅将查询定向向相应关系组中的关系来训练专用查询。质量感知多分配进一步通过将一个 GT 分配给多个预测,这些预测与 GT 在主体、对象和它们之间的关系上非常接近,来促进训练。实验结果和分析表明,SpeaQ有效地训练了专用查询,这更好地利用了模型的能力,从而在多个 VRD 模型和基准上实现了显著的性能提升。代码可在此处访问:https://www.aclweb.org/anthology/N22-11969
https://arxiv.org/abs/2403.17709
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. While previous works attempt to exploit the verbo-visual relation with proposed cross-modal transformers, unstructured natural utterances and scattered objects might lead to undesirable performances. In this paper, we introduce DOrA, a novel 3D visual grounding framework with Order-Aware referring. DOrA is designed to leverage Large Language Models (LLMs) to parse language description, suggesting a referential order of anchor objects. Such ordered anchor objects allow DOrA to update visual features and locate the target object during the grounding process. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in both low-resource and full-data scenarios. In particular, DOrA surpasses current state-of-the-art frameworks by 9.3% and 7.8% grounding accuracy under 1% data and 10% data settings, respectively.
3D视觉 grounded 旨在通过自然语言描述中的目标对象,在3D点云场景中确定目标对象。然而,以前的工作试图利用所提出的跨模态变换器利用动词-视觉关系,但无结构的自然语句和分散的对象可能会导致不良的性能。在本文中,我们引入了DOrA,一种新颖的3D视觉 grounded 框架,具有Order-Aware参考。DOrA旨在利用大型语言模型(LLMs)解析语言描述,建议锚对象之间的参照顺序。这样的有序锚对象允许DOrA在 grounding 过程中更新视觉特征并定位目标对象。在NR3D和ScanRefer数据集上的实验结果证实了我们在低资源和高资源场景中的卓越性。特别地,DOrA在1%数据和10%数据设置下的grounding准确度分别比现有最先进框架高9.3%和7.8%。
https://arxiv.org/abs/2403.16539
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
视觉关系检测旨在在图像中识别物体及其关系。之前的方法通过向现有的物体检测架构中添加单独的关系模块或解码器来解决这个问题。这种分离增加了复杂度,并阻碍了端到端训练,这限制了性能。我们提出了一种简单的且高效的无解码器架构,用于开放词汇的视觉关系检测。我们的模型包括一个基于Transformer的图像编码器,它将物体表示为标记,并隐含地建模它们之间的关系。为了提取关系信息,我们引入了一个注意力机制,选择可能形成关系的物体对。我们提供了一种在混合物体和关系检测数据上训练此模型的单阶段 recipe。我们的方法在实时推理速度下实现了视觉基因组和大型词汇GQA基准中的最先进关系检测性能。我们还提供了关于零散性能、消融和真实世界质量实例的分析。
https://arxiv.org/abs/2403.14270
The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual contents and ground text to them. Nonetheless, current LVLMs still struggle to precisely understand visual relations due to the lack of relevant data. In this work, we present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations, temporal associations and geometric transforms. Extensive case studies and quantitative evaluations show RelationVLM has strong capability in understanding such relations and emerges impressive in-context capability of reasoning from few-shot examples by comparison. This work fosters the advancements of LVLMs by enabling them to support a wider range of downstream applications toward artificial general intelligence.
大视图语言模型的(LVLM)发展试图赶超大型语言模型(LLMs)的成功,然而要解决的问题还很多。最近的工作使LVLM能够将物体级视觉内容进行本地化,并将文本与它们绑定。然而,由于缺乏相关数据,当前的LVLM仍然很难精确理解视觉关系。在本文中,我们提出了关系VLM,一种大型视觉语言模型,可以理解各种关系,无论是跨越多张图片还是在一个视频中。具体来说,我们设计了一个多阶段关系感知训练计划和一系列相应的数据配置策略,以赋予关系VLM理解语义关系、时间关联和几何变换的能力。大量案例研究和定量评估表明,关系VLM在理解这些关系方面具有很强的能力,并且在从少样本情况下进行推理时表现出色。这项工作通过使LVLM支持更广泛的下游应用,促进了其发展。
https://arxiv.org/abs/2403.12801
Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
机器从图像和视频中理解视觉信息的主要挑战有两个。首先,在连接视觉和语言之间存在计算和推理差距,这使得准确确定给定代理对哪个对象进行操作并将其通过语言表示为困难。其次,由单个单体神经网络训练的分类器通常缺乏稳定性和泛化能力。为了克服这些挑战,我们引入了MoE-VRD,一种利用专家混合的新视觉关系检测方法。MoE-VRD以<主体,谓词,对象>元组的形式识别视觉处理中的语言三元组以提取关系。利用最近在视觉关系检测方面的进展,MoE-VRD在建立主体(进行操作)与物体(被操作)之间的关系方面解决了动作识别的要求。与单体网络相比,MoE-VRD采用多个小模型作为专家,其输出进行聚合。每个专家在MoE-VRD专门研究视觉关系学习和对象标记。通过使用稀疏门控的专家混合,MoE-VRD实现了条件计算,显著增强了神经网络能力,而不会增加计算复杂度。我们的实验结果表明,条件计算能力和可扩展性是专家混合方法的优越性能在视觉关系检测方面比最先进的方法更显著。
https://arxiv.org/abs/2403.03994
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at this https URL
条件扩散模型在高端文本指导下的视觉生成和编辑方面表现出了卓越的性能。然而,当前主要关注于将文本-视觉关系仅引入反向过程,而忽视其在正向过程的 relevance。这种正向和反向过程之间的不一致性可能限制了文本合成结果中精确传达文本语义的能力。为了解决这个问题,我们提出了一个新颖且一般化的条件扩散模型(ContextDiff),通过将跨模态上下文涵盖文本条件和视觉样本之间的交互和匹配引入到正向和反向过程中,从而实现文本条件下的扩散。我们将这个上下文传递到两个过程的所有时间步,以适应它们的轨迹,从而促进跨模态条件建模。我们对DDPM和DDIMs进行了理论推导,并展示了我们的模型在两个具有挑战性的任务上的效果:文本到图像生成和文本到视频编辑。在每项任务中,我们的ContextDiff都实现了最先进的性能,显著增强了文本条件和生成样本之间的语义对齐,正如定量和定性评估所证明的。我们的代码可以从该链接下载:
https://arxiv.org/abs/2402.16627
Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of generalization of the relations learned. However, in recent years, object-centric representation learning has been put forward as a way to achieve visual reasoning within the deep learning framework. Object-centric models attempt to model input scenes as compositions of objects and relations between them. To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects. In this work we tested relation learning and generalization in several object-centric models, as well as a ResNet-50 baseline. In contrast to previous research, which has focused heavily in the same-different task in order to asses relational reasoning in DNNs, we use a set of tasks -- with varying degrees of difficulty -- derived from the comparative cognition literature. Our results show that object-centric models are able to segregate the different objects in a scene, even in many out-of-distribution cases. In our simpler tasks, this improves their capacity to learn and generalize visual relations in comparison to the ResNet-50 baseline. However, object-centric models still struggle in our more difficult tasks and conditions. We conclude that abstract visual reasoning remains an open challenge for DNNs, including object-centric models.
实现视觉推理是人工智能的一个长期目标。在过去的十年里,几项研究将深度神经网络(DNNs)应用于从图像中学习视觉关系,虽然这些模型的泛化关系有所提高,但近年来,以物体为中心的表示学习作为一种在深度学习框架内实现视觉推理的方法被提出。物体中心模型试图将输入场景建模为物体和它们之间的关系的组合。为此,这些模型使用多种关注机制将场景中的单个物体从背景和与其他物体区分。在这项工作中,我们测试了关系学习和泛化在多个物体中心模型以及一个ResNet-50基线上的效果。与之前的研究不同,该研究关注的是相同不同任务,以评估DNNs中的关系推理。我们的结果表明,物体中心模型能够将场景中的不同物体进行区分,即使在很多分布不下的情况下也是如此。在我们的简单任务中,这使得物体中心模型能够更好地学习和泛化视觉关系,与ResNet-50基线相比提高了其能力。然而,在更困难的任务和条件下,物体中心模型仍然存在困难。我们得出结论,对于DNNs来说,抽象视觉推理仍然是一个未解决的问题,包括物体中心模型。
https://arxiv.org/abs/2402.12675
The challenge in learning abstract concepts from images in an unsupervised fashion lies in the required integration of visual perception and generalizable relational reasoning. Moreover, the unsupervised nature of this task makes it necessary for human users to be able to understand a model's learnt concepts and potentially revise false behaviours. To tackle both the generalizability and interpretability constraints of visual concept learning, we propose Pix2Code, a framework that extends program synthesis to visual relational reasoning by utilizing the abilities of both explicit, compositional symbolic and implicit neural representations. This is achieved by retrieving object representations from images and synthesizing relational concepts as lambda-calculus programs. We evaluate the diverse properties of Pix2Code on the challenging reasoning domains, Kandinsky Patterns and CURI, thereby testing its ability to identify compositional visual concepts that generalize to novel data and concept configurations. Particularly, in stark contrast to neural approaches, we show that Pix2Code's representations remain human interpretable and can be easily revised for improved performance.
学习无监督方式下从图像中抽象概念的挑战在于视觉感知和可扩展的关系推理的所需整合。此外,任务的非监督性质意味着人类用户需要能够理解模型学到的概念,并可能纠正错误的行为。为了解决视觉概念学习的泛化和可解释性约束,我们提出了Pix2Code,一个通过利用显式、组合式符号和隐式神经表示的两种能力的框架来扩展程序合成来解决视觉关系推理的问题。通过从图像中检索物体表示并生成关系概念作为lambda-calculus程序,我们实现了这一目标。我们在具有挑战性的推理领域Kandinsky模式和CURI上评估了Pix2Code的各种属性,从而测试了它识别可扩展视觉概念的能力,以及是否能够识别适用于新数据和概念配置的组合视觉概念。与神经方法相比,我们证明了Pix2Code的表示仍然具有人类可解释性,并且可以很容易地进行修改以提高性能。
https://arxiv.org/abs/2402.08280
Scene graph generation (SGG) endeavors to predict visual relationships between pairs of objects within an image. Prevailing SGG methods traditionally assume a one-off learning process for SGG. This conventional paradigm may necessitate repetitive training on all previously observed samples whenever new relationships emerge, mitigating the risk of forgetting previously acquired knowledge. This work seeks to address this pitfall inherent in a suite of prior relationship predictions. Motivated by the achievements of in-context learning in pretrained language models, our approach imbues the model with the capability to predict relationships and continuously acquire novel knowledge without succumbing to catastrophic forgetting. To achieve this goal, we introduce a novel and pragmatic framework for scene graph generation, namely Lifelong Scene Graph Generation (LSGG), where tasks, such as predicates, unfold in a streaming fashion. In this framework, the model is constrained to exclusive training on the present task, devoid of access to previously encountered training data, except for a limited number of exemplars, but the model is tasked with inferring all predicates it has encountered thus far. Rigorous experiments demonstrate the superiority of our proposed method over state-of-the-art SGG models in the context of LSGG across a diverse array of metrics. Besides, extensive experiments on the two mainstream benchmark datasets, VG and Open-Image(v6), show the superiority of our proposed model to a number of competitive SGG models in terms of continuous learning and conventional settings. Moreover, comprehensive ablation experiments demonstrate the effectiveness of each component in our model.
场景图生成(SGG)旨在预测图像中物体对之间的视觉关系。预先存在的SGG方法通常假设SGG需要一次性的学习过程。这种传统范式可能需要在所有之前观察到的样本上进行重复训练,从而减轻忘记之前获得知识的风险。本文旨在解决一系列先前的关系预测中的这一缺陷。受到预训练语言模型在上下文学习方面的成就的启发,我们的方法使模型具有预测关系的能力,并且在不失记忆的情况下持续获得新知识。为了实现这一目标,我们引入了一个新颖而实用的框架,即终身场景图生成(LSGG),其中任务以流式方式展开。在这种框架下,模型被限制在仅对当前任务的训练上,而无法访问先前的训练数据,除了有限数量的示例,但模型负责推断它所遇到的全部关系。严格的实验证明,在LSGG的背景下,我们提出的方法在各种指标上优于最先进的SGG模型。此外,对VG和Open-Image(v6)这两大主流基准数据集的广泛实验证明,我们提出的模型在持续学习和传统设置方面优于许多竞争的SGG模型。最后,全面的消融实验证明了每个组件在我們的方法中的有效性。
https://arxiv.org/abs/2401.14626
Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation problem. This methodology consists of two essential procedures: identifying individual text units as graph nodes and discerning the sequential reading order relationships among these units as graph edges. Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our framework innovates further by integrating a novel mechanism, a Dynamic Relation Transformer (DRFormer), dedicated to edge generation. DRFormer incorporates a dual interactive transformer decoder that deftly manages a dynamic graph structure refinement process. Through this iterative process, the model systematically enhances the graph's fidelity, ultimately resulting in improved precision in detecting contextual text blocks. Comprehensive experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context datasets substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our graph generation framework in advancing the field of CTBD.
上下文文本块检测(CTBD)是在自然场景的复杂性中识别连贯文本块的任务。之前的方法将CTBD视为计算机视觉中的关系提取挑战或自然语言处理角度下的序列建模问题。我们引入了一种新的框架,将CTBD视为图生成问题。该方法包括两个基本步骤:将单个文本单元识别为图节点,并分辨这些单元之间的序列阅读顺序关系作为图边。利用DQ-DETR节点检测的先进功能,我们的框架通过引入一种名为动态关系Transformer(DRFormer)的新机制进一步创新,该机制专门用于边生成。DRFormer包括一个双交互式变压器解码器,巧妙地管理了动态图结构的细化过程。通过这一迭代过程,模型系统地增强了图的可靠性,最终在检测上下文文本块方面实现了更高的准确度。在SCUT-CTW-Context和ReCTS-Context数据集上进行的全局实验评估证实了我们的方法达到了最先进水平,进一步突出了我们图生成框架在推进CTBD领域中的有效性和潜力。
https://arxiv.org/abs/2401.09232
The task of Visual Relationship Recognition (VRR) aims to identify relationships between two interacting objects in an image and is particularly challenging due to the widely-spread and highly imbalanced distribution of <subject, relation, object> triplets. To overcome the resultant performance bias in existing VRR approaches, we introduce DiffAugment -- a method which first augments the tail classes in the linguistic space by making use of WordNet and then utilizes the generative prowess of Diffusion Models to expand the visual space for minority classes. We propose a novel hardness-aware component in diffusion which is based upon the hardness of each <S,R,O> triplet and demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes. We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings. Extensive experimentation on the GQA-LT dataset shows favorable gains in the subject/object and relation average per-class accuracy using Diffusion augmented samples.
视觉关系识别(VRR)的任务是识别图像中两个相互作用的物体之间的关系,这尤其具有挑战性,因为<主题,关系,物体>三元组分布广泛且不平衡。为了克服现有VRR方法中导致结果性能偏差,我们引入了DiffAugment方法——一种利用WordNet对语言空间尾巴类进行增强的方法,然后利用扩散模型的生成能力来扩展少数类视觉空间。我们提出了一个新颖的复杂性感知组件,基于每个<S,R,O>三元组的不平衡性,并证明了复杂性感知扩散在生成尾类视觉嵌入方面的有效性。我们还提出了一个新的基于扩散的采样策略,可以改善生成的视觉嵌入的判别能力。在GQA-LT数据集上进行广泛的实验,使用扩散增强样本后,主题/物体和关系平均每类准确度都取得了良好的提高。
https://arxiv.org/abs/2401.01387
Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.
视觉关系是复杂、多模态的概念,在人类感知世界的方式中扮演着重要角色。由于其复杂性,高品质、多样化和大规模的视觉关系数据集仍然缺乏。为了克服这一数据障碍,我们选择关注一个被社区忽略的问题:少样本视觉关系检测(VRD)场景。在这项工作中,我们提出了第一个不需要任何注释关系预训练方法的少样本命题分类方法。我们通过引入一个生成模型,能够捕捉关系内部语义、视觉和空间信息的变化,然后利用其表示来进行高效的少样本分类。我们构建了少样本训练集,并在VG200和VRD数据集上进行了定量实验,结果表明我们的模型超越了基线。最后,我们通过进行各种定性实验尝试解释模型的决策。
https://arxiv.org/abs/2311.16261
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at this https URL.
我们提出了一个新颖的自监督表示学习方法,特别是针对视觉关系检测(VRD)任务。受到掩码图像建模(MIM)的有效性的启发,我们提出了掩码边界框重构(MBBR),这是一种MIM的变体,其中场景中的实体/对象的一部分被遮罩,然后根据未遮罩的对象进行重构。核心思想是,通过物体级别的遮罩建模,网络学习了一个捕捉场景中物体之间交互的上下文感知表示,因此具有高度预测视觉对象关系的预测能力。我们在几个样本设置中对其学习到的表示进行了广泛的评估,并且证明了MBBR对于学习适用于VRD的稳健表示非常有效。该方法在仅用几篇注释样本的情况下,能够超越最先进的VRD方法在命题检测(PredDet)评估设置。我们将代码公开在以下链接处:
https://arxiv.org/abs/2311.04834
Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.
场景图生成(SGG)和人类-物体交互(HOI)检测是旨在定位和识别物体之间关系以及人类与物体之间关系的两个重要视觉任务。先前的研究将这些任务视为独立的任务,导致针对单个数据集开发了特定任务的模型。然而,我们提出,视觉关系的存在可以为人类-物体交互推理提供关键的上下文和复杂的关系线索,从而显著增强交互推断。这激励我们将目光放在这两个任务之间是否存在自然内在关系上,其中场景图可以作为推断人类-物体交互的来源。因此,我们引入了SG2HOI+,一种基于Transformer架构的统一一步模型。我们的方法采用两个交互式分层Transformer来无缝统一SGG和HOI检测任务。具体来说,我们使用关系Transformer生成一系列视觉特征中的一对关系三元组。然后,我们使用另一个基于Transformer的解码器根据生成的关系三元组预测人类-物体交互。在包括Visual Genome、V-COCO和HICO-DET等现有基准数据集的全面系列实验中,展示了我们SG2HOI+模型的引人入胜的性能与先前的单阶段SGG模型的性能相比。值得注意的是,与最先进的HOI方法相比,我们的方法在性能上具有竞争优势。此外,我们观察到,在端到端的方式同时训练SGG和HOI任务时,我们的SG2HOI+模型对于两个任务都取得了显著的提高,相对于个性化的训练范式。
https://arxiv.org/abs/2311.01755
Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.
动态场景图生成(SGG)从视频中需要不仅全面了解易受时间波动的场景中对象的全面情况,还需要模型对不同对象的时空运动和相互作用进行建模。此外,长尾分布的视觉关系是大多数动态 SGG 方法的瓶颈,因为它们主要关注使用复杂架构捕捉空间-时间上下文,导致生成有偏的场景图。为解决这些挑战,我们提出了 FloCoDe: 基于流的时空一致性和不确定性减轻的关联消解 for unbiased dynamic scene graphs。FloCoDe 使用流特征扭曲来检测帧间一致的物体。此外,它使用关联消解来学习无偏的类之间的关系表示。此外,为了消弱预测的不确定性,它使用正则化索索交叉熵损失和对比损失来结合标签关联以确定共现的关系,并帮助消除有偏的类。大量的实验评估显示,生成更无偏的场景图的性能提高可达 4.1%,证明了生成更有利的场景图的优势。
https://arxiv.org/abs/2310.16073
Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.
尽管深度神经网络在许多物体识别基准测试中可以达到人类水平的表现,但先前的研究表明,这些相同的模型在学习简单的抽象关系方面(如确定两个物体是否相同或不同)表现不佳。大部分先前的研究都集中在训练卷积神经网络对两个相同或两个不同抽象形状的图像进行分类,并在内部分布刺激物上进行泛化测试。在本文中,我们全面研究了各种架构、预训练形式和微调数据集是否能够使深度神经网络获取和泛化相同或不同关系。我们发现,某些预训练的变压器可以学会一个精确的相同或不同关系,并且该关系在对外部分布刺激物上的泛化精度非常接近完美。此外,我们发现,对缺乏纹理或颜色的抽象形状进行微调可以提供最强的外部分布泛化。我们的结果表明,只要有正确的方法,深度神经网络可以学习泛化的相同或不同视觉关系。
https://arxiv.org/abs/2310.09612
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。
https://arxiv.org/abs/2310.09147