Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
机器从图像和视频中理解视觉信息的主要挑战有两个。首先,在连接视觉和语言之间存在计算和推理差距,这使得准确确定给定代理对哪个对象进行操作并将其通过语言表示为困难。其次,由单个单体神经网络训练的分类器通常缺乏稳定性和泛化能力。为了克服这些挑战,我们引入了MoE-VRD,一种利用专家混合的新视觉关系检测方法。MoE-VRD以<主体,谓词,对象>元组的形式识别视觉处理中的语言三元组以提取关系。利用最近在视觉关系检测方面的进展,MoE-VRD在建立主体(进行操作)与物体(被操作)之间的关系方面解决了动作识别的要求。与单体网络相比,MoE-VRD采用多个小模型作为专家,其输出进行聚合。每个专家在MoE-VRD专门研究视觉关系学习和对象标记。通过使用稀疏门控的专家混合,MoE-VRD实现了条件计算,显著增强了神经网络能力,而不会增加计算复杂度。我们的实验结果表明,条件计算能力和可扩展性是专家混合方法的优越性能在视觉关系检测方面比最先进的方法更显著。
https://arxiv.org/abs/2403.03994
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at this https URL
条件扩散模型在高端文本指导下的视觉生成和编辑方面表现出了卓越的性能。然而,当前主要关注于将文本-视觉关系仅引入反向过程,而忽视其在正向过程的 relevance。这种正向和反向过程之间的不一致性可能限制了文本合成结果中精确传达文本语义的能力。为了解决这个问题,我们提出了一个新颖且一般化的条件扩散模型(ContextDiff),通过将跨模态上下文涵盖文本条件和视觉样本之间的交互和匹配引入到正向和反向过程中,从而实现文本条件下的扩散。我们将这个上下文传递到两个过程的所有时间步,以适应它们的轨迹,从而促进跨模态条件建模。我们对DDPM和DDIMs进行了理论推导,并展示了我们的模型在两个具有挑战性的任务上的效果:文本到图像生成和文本到视频编辑。在每项任务中,我们的ContextDiff都实现了最先进的性能,显著增强了文本条件和生成样本之间的语义对齐,正如定量和定性评估所证明的。我们的代码可以从该链接下载:
https://arxiv.org/abs/2402.16627
Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual relations from images, with modest results in terms of generalization of the relations learned. However, in recent years, object-centric representation learning has been put forward as a way to achieve visual reasoning within the deep learning framework. Object-centric models attempt to model input scenes as compositions of objects and relations between them. To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects. In this work we tested relation learning and generalization in several object-centric models, as well as a ResNet-50 baseline. In contrast to previous research, which has focused heavily in the same-different task in order to asses relational reasoning in DNNs, we use a set of tasks -- with varying degrees of difficulty -- derived from the comparative cognition literature. Our results show that object-centric models are able to segregate the different objects in a scene, even in many out-of-distribution cases. In our simpler tasks, this improves their capacity to learn and generalize visual relations in comparison to the ResNet-50 baseline. However, object-centric models still struggle in our more difficult tasks and conditions. We conclude that abstract visual reasoning remains an open challenge for DNNs, including object-centric models.
实现视觉推理是人工智能的一个长期目标。在过去的十年里,几项研究将深度神经网络(DNNs)应用于从图像中学习视觉关系,虽然这些模型的泛化关系有所提高,但近年来,以物体为中心的表示学习作为一种在深度学习框架内实现视觉推理的方法被提出。物体中心模型试图将输入场景建模为物体和它们之间的关系的组合。为此,这些模型使用多种关注机制将场景中的单个物体从背景和与其他物体区分。在这项工作中,我们测试了关系学习和泛化在多个物体中心模型以及一个ResNet-50基线上的效果。与之前的研究不同,该研究关注的是相同不同任务,以评估DNNs中的关系推理。我们的结果表明,物体中心模型能够将场景中的不同物体进行区分,即使在很多分布不下的情况下也是如此。在我们的简单任务中,这使得物体中心模型能够更好地学习和泛化视觉关系,与ResNet-50基线相比提高了其能力。然而,在更困难的任务和条件下,物体中心模型仍然存在困难。我们得出结论,对于DNNs来说,抽象视觉推理仍然是一个未解决的问题,包括物体中心模型。
https://arxiv.org/abs/2402.12675
The challenge in learning abstract concepts from images in an unsupervised fashion lies in the required integration of visual perception and generalizable relational reasoning. Moreover, the unsupervised nature of this task makes it necessary for human users to be able to understand a model's learnt concepts and potentially revise false behaviours. To tackle both the generalizability and interpretability constraints of visual concept learning, we propose Pix2Code, a framework that extends program synthesis to visual relational reasoning by utilizing the abilities of both explicit, compositional symbolic and implicit neural representations. This is achieved by retrieving object representations from images and synthesizing relational concepts as lambda-calculus programs. We evaluate the diverse properties of Pix2Code on the challenging reasoning domains, Kandinsky Patterns and CURI, thereby testing its ability to identify compositional visual concepts that generalize to novel data and concept configurations. Particularly, in stark contrast to neural approaches, we show that Pix2Code's representations remain human interpretable and can be easily revised for improved performance.
学习无监督方式下从图像中抽象概念的挑战在于视觉感知和可扩展的关系推理的所需整合。此外,任务的非监督性质意味着人类用户需要能够理解模型学到的概念,并可能纠正错误的行为。为了解决视觉概念学习的泛化和可解释性约束,我们提出了Pix2Code,一个通过利用显式、组合式符号和隐式神经表示的两种能力的框架来扩展程序合成来解决视觉关系推理的问题。通过从图像中检索物体表示并生成关系概念作为lambda-calculus程序,我们实现了这一目标。我们在具有挑战性的推理领域Kandinsky模式和CURI上评估了Pix2Code的各种属性,从而测试了它识别可扩展视觉概念的能力,以及是否能够识别适用于新数据和概念配置的组合视觉概念。与神经方法相比,我们证明了Pix2Code的表示仍然具有人类可解释性,并且可以很容易地进行修改以提高性能。
https://arxiv.org/abs/2402.08280
Scene graph generation (SGG) endeavors to predict visual relationships between pairs of objects within an image. Prevailing SGG methods traditionally assume a one-off learning process for SGG. This conventional paradigm may necessitate repetitive training on all previously observed samples whenever new relationships emerge, mitigating the risk of forgetting previously acquired knowledge. This work seeks to address this pitfall inherent in a suite of prior relationship predictions. Motivated by the achievements of in-context learning in pretrained language models, our approach imbues the model with the capability to predict relationships and continuously acquire novel knowledge without succumbing to catastrophic forgetting. To achieve this goal, we introduce a novel and pragmatic framework for scene graph generation, namely Lifelong Scene Graph Generation (LSGG), where tasks, such as predicates, unfold in a streaming fashion. In this framework, the model is constrained to exclusive training on the present task, devoid of access to previously encountered training data, except for a limited number of exemplars, but the model is tasked with inferring all predicates it has encountered thus far. Rigorous experiments demonstrate the superiority of our proposed method over state-of-the-art SGG models in the context of LSGG across a diverse array of metrics. Besides, extensive experiments on the two mainstream benchmark datasets, VG and Open-Image(v6), show the superiority of our proposed model to a number of competitive SGG models in terms of continuous learning and conventional settings. Moreover, comprehensive ablation experiments demonstrate the effectiveness of each component in our model.
场景图生成(SGG)旨在预测图像中物体对之间的视觉关系。预先存在的SGG方法通常假设SGG需要一次性的学习过程。这种传统范式可能需要在所有之前观察到的样本上进行重复训练,从而减轻忘记之前获得知识的风险。本文旨在解决一系列先前的关系预测中的这一缺陷。受到预训练语言模型在上下文学习方面的成就的启发,我们的方法使模型具有预测关系的能力,并且在不失记忆的情况下持续获得新知识。为了实现这一目标,我们引入了一个新颖而实用的框架,即终身场景图生成(LSGG),其中任务以流式方式展开。在这种框架下,模型被限制在仅对当前任务的训练上,而无法访问先前的训练数据,除了有限数量的示例,但模型负责推断它所遇到的全部关系。严格的实验证明,在LSGG的背景下,我们提出的方法在各种指标上优于最先进的SGG模型。此外,对VG和Open-Image(v6)这两大主流基准数据集的广泛实验证明,我们提出的模型在持续学习和传统设置方面优于许多竞争的SGG模型。最后,全面的消融实验证明了每个组件在我們的方法中的有效性。
https://arxiv.org/abs/2401.14626
Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation problem. This methodology consists of two essential procedures: identifying individual text units as graph nodes and discerning the sequential reading order relationships among these units as graph edges. Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our framework innovates further by integrating a novel mechanism, a Dynamic Relation Transformer (DRFormer), dedicated to edge generation. DRFormer incorporates a dual interactive transformer decoder that deftly manages a dynamic graph structure refinement process. Through this iterative process, the model systematically enhances the graph's fidelity, ultimately resulting in improved precision in detecting contextual text blocks. Comprehensive experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context datasets substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our graph generation framework in advancing the field of CTBD.
上下文文本块检测(CTBD)是在自然场景的复杂性中识别连贯文本块的任务。之前的方法将CTBD视为计算机视觉中的关系提取挑战或自然语言处理角度下的序列建模问题。我们引入了一种新的框架,将CTBD视为图生成问题。该方法包括两个基本步骤:将单个文本单元识别为图节点,并分辨这些单元之间的序列阅读顺序关系作为图边。利用DQ-DETR节点检测的先进功能,我们的框架通过引入一种名为动态关系Transformer(DRFormer)的新机制进一步创新,该机制专门用于边生成。DRFormer包括一个双交互式变压器解码器,巧妙地管理了动态图结构的细化过程。通过这一迭代过程,模型系统地增强了图的可靠性,最终在检测上下文文本块方面实现了更高的准确度。在SCUT-CTW-Context和ReCTS-Context数据集上进行的全局实验评估证实了我们的方法达到了最先进水平,进一步突出了我们图生成框架在推进CTBD领域中的有效性和潜力。
https://arxiv.org/abs/2401.09232
The task of Visual Relationship Recognition (VRR) aims to identify relationships between two interacting objects in an image and is particularly challenging due to the widely-spread and highly imbalanced distribution of <subject, relation, object> triplets. To overcome the resultant performance bias in existing VRR approaches, we introduce DiffAugment -- a method which first augments the tail classes in the linguistic space by making use of WordNet and then utilizes the generative prowess of Diffusion Models to expand the visual space for minority classes. We propose a novel hardness-aware component in diffusion which is based upon the hardness of each <S,R,O> triplet and demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes. We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings. Extensive experimentation on the GQA-LT dataset shows favorable gains in the subject/object and relation average per-class accuracy using Diffusion augmented samples.
视觉关系识别(VRR)的任务是识别图像中两个相互作用的物体之间的关系,这尤其具有挑战性,因为<主题,关系,物体>三元组分布广泛且不平衡。为了克服现有VRR方法中导致结果性能偏差,我们引入了DiffAugment方法——一种利用WordNet对语言空间尾巴类进行增强的方法,然后利用扩散模型的生成能力来扩展少数类视觉空间。我们提出了一个新颖的复杂性感知组件,基于每个<S,R,O>三元组的不平衡性,并证明了复杂性感知扩散在生成尾类视觉嵌入方面的有效性。我们还提出了一个新的基于扩散的采样策略,可以改善生成的视觉嵌入的判别能力。在GQA-LT数据集上进行广泛的实验,使用扩散增强样本后,主题/物体和关系平均每类准确度都取得了良好的提高。
https://arxiv.org/abs/2401.01387
Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.
视觉关系是复杂、多模态的概念,在人类感知世界的方式中扮演着重要角色。由于其复杂性,高品质、多样化和大规模的视觉关系数据集仍然缺乏。为了克服这一数据障碍,我们选择关注一个被社区忽略的问题:少样本视觉关系检测(VRD)场景。在这项工作中,我们提出了第一个不需要任何注释关系预训练方法的少样本命题分类方法。我们通过引入一个生成模型,能够捕捉关系内部语义、视觉和空间信息的变化,然后利用其表示来进行高效的少样本分类。我们构建了少样本训练集,并在VG200和VRD数据集上进行了定量实验,结果表明我们的模型超越了基线。最后,我们通过进行各种定性实验尝试解释模型的决策。
https://arxiv.org/abs/2311.16261
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at this https URL.
我们提出了一个新颖的自监督表示学习方法,特别是针对视觉关系检测(VRD)任务。受到掩码图像建模(MIM)的有效性的启发,我们提出了掩码边界框重构(MBBR),这是一种MIM的变体,其中场景中的实体/对象的一部分被遮罩,然后根据未遮罩的对象进行重构。核心思想是,通过物体级别的遮罩建模,网络学习了一个捕捉场景中物体之间交互的上下文感知表示,因此具有高度预测视觉对象关系的预测能力。我们在几个样本设置中对其学习到的表示进行了广泛的评估,并且证明了MBBR对于学习适用于VRD的稳健表示非常有效。该方法在仅用几篇注释样本的情况下,能够超越最先进的VRD方法在命题检测(PredDet)评估设置。我们将代码公开在以下链接处:
https://arxiv.org/abs/2311.04834
Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.
场景图生成(SGG)和人类-物体交互(HOI)检测是旨在定位和识别物体之间关系以及人类与物体之间关系的两个重要视觉任务。先前的研究将这些任务视为独立的任务,导致针对单个数据集开发了特定任务的模型。然而,我们提出,视觉关系的存在可以为人类-物体交互推理提供关键的上下文和复杂的关系线索,从而显著增强交互推断。这激励我们将目光放在这两个任务之间是否存在自然内在关系上,其中场景图可以作为推断人类-物体交互的来源。因此,我们引入了SG2HOI+,一种基于Transformer架构的统一一步模型。我们的方法采用两个交互式分层Transformer来无缝统一SGG和HOI检测任务。具体来说,我们使用关系Transformer生成一系列视觉特征中的一对关系三元组。然后,我们使用另一个基于Transformer的解码器根据生成的关系三元组预测人类-物体交互。在包括Visual Genome、V-COCO和HICO-DET等现有基准数据集的全面系列实验中,展示了我们SG2HOI+模型的引人入胜的性能与先前的单阶段SGG模型的性能相比。值得注意的是,与最先进的HOI方法相比,我们的方法在性能上具有竞争优势。此外,我们观察到,在端到端的方式同时训练SGG和HOI任务时,我们的SG2HOI+模型对于两个任务都取得了显著的提高,相对于个性化的训练范式。
https://arxiv.org/abs/2311.01755
Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.
动态场景图生成(SGG)从视频中需要不仅全面了解易受时间波动的场景中对象的全面情况,还需要模型对不同对象的时空运动和相互作用进行建模。此外,长尾分布的视觉关系是大多数动态 SGG 方法的瓶颈,因为它们主要关注使用复杂架构捕捉空间-时间上下文,导致生成有偏的场景图。为解决这些挑战,我们提出了 FloCoDe: 基于流的时空一致性和不确定性减轻的关联消解 for unbiased dynamic scene graphs。FloCoDe 使用流特征扭曲来检测帧间一致的物体。此外,它使用关联消解来学习无偏的类之间的关系表示。此外,为了消弱预测的不确定性,它使用正则化索索交叉熵损失和对比损失来结合标签关联以确定共现的关系,并帮助消除有偏的类。大量的实验评估显示,生成更无偏的场景图的性能提高可达 4.1%,证明了生成更有利的场景图的优势。
https://arxiv.org/abs/2310.16073
Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.
尽管深度神经网络在许多物体识别基准测试中可以达到人类水平的表现,但先前的研究表明,这些相同的模型在学习简单的抽象关系方面(如确定两个物体是否相同或不同)表现不佳。大部分先前的研究都集中在训练卷积神经网络对两个相同或两个不同抽象形状的图像进行分类,并在内部分布刺激物上进行泛化测试。在本文中,我们全面研究了各种架构、预训练形式和微调数据集是否能够使深度神经网络获取和泛化相同或不同关系。我们发现,某些预训练的变压器可以学会一个精确的相同或不同关系,并且该关系在对外部分布刺激物上的泛化精度非常接近完美。此外,我们发现,对缺乏纹理或颜色的抽象形状进行微调可以提供最强的外部分布泛化。我们的结果表明,只要有正确的方法,深度神经网络可以学习泛化的相同或不同视觉关系。
https://arxiv.org/abs/2310.09612
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。
https://arxiv.org/abs/2310.09147
Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
大规模文本到图像生成模型是生成AI领域的突破性进展,扩散模型表现出在输入文本 prompt 后生成令人信服的图像的能力。图像编辑研究的目标是通过修改文本 prompt 来让用户控制生成的图像。目前的图像编辑技术容易意外修改超出目标区域的区域,例如背景或与目标对象有某些语义或视觉关系的干扰对象。根据我们的实验结果,不准确的交叉注意力地图是这个问题的根源。基于这个观察,我们提出了动态Prompt Learning(DPL),以强制交叉注意力地图关注文本 prompt 中的正确名词单词。通过更新动态代币对名词的文本输入中的动态代币,我们实现了对特定物体的细粒度图像编辑,同时防止其他图像区域不必要的变化。基于公开可用的稳定扩散方法,我们对多种图像进行了广泛评估,并 consistently 获得了 quantitative(CLIP score,结构-dist)和 qualitative(用户评估)上卓越的结果。我们展示了改进的 prompt 编辑结果,包括单词交换、Prompt refinement 和注意力重新加权,特别是复杂多物体场景。
https://arxiv.org/abs/2309.15664
Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
理解物体之间的关系对于理解视觉场景的语义是至关重要的。它也是连接视觉和语言模型的至关重要的步骤。然而,当前最先进的计算机视觉模型仍然缺乏很好的空间推理能力。现有的数据集大多覆盖了相对较小的空间关系,这些空间关系并不包含本质上不涉及运动的关系。在本文中,我们提出了“空间与时间的理解”数据集(STUPD)——一个大规模的视频数据集,用于理解从英语短语中的空间关系提取的静态和动态空间关系。该数据集包含150万可视化呈现(视频和图像),由30个不同的空间短语感知组成,通过使用Unity3D生成的对象交互模拟生成。除了空间关系,我们还提出了10个时间关系上的50万可视化呈现,包括视频描述事件/时间点交互。据我们所知,目前还没有数据集通过视觉设置来表示时间关系。在这个数据集中,我们提供了有关对象交互的三维信息,如帧率坐标和使用的对象的描述。该合成数据集的目的是帮助模型在现实世界场景中更好地进行视觉关系检测。我们在STUPD数据集上预训练后,与其他预训练数据集相比,证明了各种模型的性能提高了。我们证明了在不同的模型上,通过视觉设置检测视觉关系时,性能提高了。
https://arxiv.org/abs/2309.06680
Food image classification serves as a fundamental and critical step in image-based dietary assessment, facilitating nutrient intake analysis from captured food images. However, existing works in food classification predominantly focuses on predicting 'food types', which do not contain direct nutritional composition information. This limitation arises from the inherent discrepancies in nutrition databases, which are tasked with associating each 'food item' with its respective information. Therefore, in this work we aim to classify food items to align with nutrition database. To this end, we first introduce VFN-nutrient dataset by annotating each food image in VFN with a food item that includes nutritional composition information. Such annotation of food items, being more discriminative than food types, creates a hierarchical structure within the dataset. However, since the food item annotations are solely based on nutritional composition information, they do not always show visual relations with each other, which poses significant challenges when applying deep learning-based techniques for classification. To address this issue, we then propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process, which allows the deep model to extract image features that are discriminative across labels. Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.
食品图像分类在基于图像的膳食评估中扮演着至关重要且关键的步骤,便于从捕获的食品图像中分析营养素摄入。然而,现有的食品分类工作主要关注预测“食品类型”,这些食品类型并没有直接的营养组成信息。这种限制源于营养数据库之间的固有差异,其任务是将每个“食品 item”与相应的信息关联起来。因此,在本工作中,我们旨在将食品 items 与营养数据库对齐,实现食品 item 分类。为此,我们首先介绍了 VFN 营养数据集,通过在 VFN 中为每个食品图像标注包含营养组成信息的食品 item。这种食品 item 的标注,比食品类型更具体,在数据集中创造了层级结构。然而,由于食品 item 标注仅基于营养组成信息,它们并不总是表现出视觉关系,这在应用深度学习技术进行分类时提出了重大挑战。为了解决这个问题,我们提出了一个多级Hierarchical 框架,通过迭代地簇集和合并食品 items during 训练过程,从而使深度模型能够提取跨越标签的视觉特征。我们的方法在 VFN 营养数据集上进行评估,与现有工作在食品类型和食品 item 分类方面相比,取得了令人鼓舞的结果。
https://arxiv.org/abs/2309.01075
Scene Graph Generation (SGG) aims to detect all the visual relation triplets <sub, pred, obj> in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.
Scene Graph Generation (SGG) 旨在在给定图像中检测所有视觉关系三对数 <sub,pred,obj>。随着各种高级技术更好地利用每个关系三对数的内在和外部信息的出现,SGG在过去几年中取得了巨大的进展。然而,由于普遍存在长尾巴的谓词分布,今天的SGG模型仍然很容易受到头谓词的影响。目前,SGG最常见的抗偏解决方案是重新平衡方法,例如改变原始训练样本的分布。在本文中,我们主张,所有现有的重新平衡策略都没有增加每个谓词的关系三对数特征的多样性,这是SGG稳健的关键。为此,我们提出了一种全新的组合特征增强策略,它是SGG中第一个从增加三对数特征多样性的角度来看消除偏见的工作。具体来说,我们首先将每个关系三对数特征分解为两个组件:内在特征和外部特征,它们对应于一个关系三对数的内在特征和外部上下文。然后,我们设计两个不同的特征增强模块,以丰富原始关系三对数的特征多样性,通过从其他样本中替换或混合它们的内在或外部特征。由于其独特的模型无关性,CFA可以无缝融入各种SGG框架中。广泛的实验表明,CFA在不同度量之间的权衡中实现了新的最先进的性能。
https://arxiv.org/abs/2308.06712
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).
场景图生成的目标是检测视觉关系三元组(主题、谓词、对象)。由于数据中的偏见,当前模型往往预测常见的谓词,例如“在”和“在”,而不是有用的谓词,例如“站在”和“看着”。这种趋势导致准确的信息和整体表现的损失。如果模型仅使用“在路上”而不是“在路上堵住了”来描述一个图像,这可能是一个非常严重的误解。我们认为,这种情况是由两个不平衡因素引起的:语义空间上的不平衡和训练样本水平的不平衡。为了解决这一问题,我们提出了 DB-SGG,一个基于去偏但不同于传统分布适应的有效框架。它整合了两个组件:语义去偏(SD)和平衡谓词学习(BPL)。SD利用混淆矩阵和二分类图来构建谓词关系。BPL采用随机Undersampling策略和歧义移除策略,重点优化有用的谓词。得益于无模型过程,我们的方法和Transformer在 SGG 数据集上的三个 SGG 子任务中的 mR@20 表现相比提高了136.3%、119.5% 和 122.6%。我们在另一个复杂的 SGG 数据集(SGG-GQA)和两个后续任务(sentence-to-graph 检索和图像摘要)上进行了进一步验证。
https://arxiv.org/abs/2308.05286
Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult. Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple approach to solve the task of table tidying, an example of robotic tasks with vague objectives. Specifically, the task of tidying a table involves not just clustering objects by type and functionality for semantic tidiness but also considering spatial-visual relations of objects for a visually pleasing arrangement, termed as visual tidiness. We propose to learn a lightweight, image-based tidiness score function to ground the semantically tidy policy of LLMs to achieve visual tidiness. We innovatively train the tidiness score using synthetic data gathered using random walks from a few tidy configurations. Such trajectories naturally encode the order of tidiness, thereby eliminating the need for laborious and expensive human demonstrations. Our empirical results show that our pipeline can be applied to unseen objects and complex 3D arrangements.
在许多实际场景中,许多目标变得模糊,给机器人带来了长期的挑战,因为定义规则、奖励或限制进行优化非常困难。例如,整理一个混乱的桌子可能会对人类来说看起来很简单,但描述整理的标准因为常识推理中的歧义和灵活性而非常复杂。最近,大型语言模型(LLM)的发展为我们提供了解决这个问题的机会:从广泛的人类数据中学习,LLM能够捕捉对人类行为有意义的常识推理。然而,由于LLM仅从语言输入中训练,它们可能会与机器人任务遇到困难,因为它们没有足够的能力处理感知和低级别控制。在这项工作中,我们提出了一种简单的方法来解决桌子整理任务,这是一个模糊目标机器人任务的示例。具体来说,整理桌子的任务不仅涉及按类型和功能将对象分组以实现语义整洁,而且还考虑空间-视觉关系,以创建一个视觉效果良好的排列,称为视觉整洁。我们建议学习一种轻量级的基于图像的整洁得分函数,以 ground LLM 语义整洁的政策,实现视觉整洁。我们创新性地使用合成数据从几个整洁配置中收集,通过随机漫步方式生成这些路径,这些路径自然编码整洁的顺序,从而消除了繁琐的人类演示的需求。我们的实验结果表明,我们的管道可以应用于未知的对象和复杂的三维布局。
https://arxiv.org/abs/2307.11319
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
Video Visual Relation Detection (VidVRD) 旨在使用空间边界和时间边界检测视频中的视觉关系 triplets。现有的 VidVRD 方法可以根据不同的分类方法将其分为bottom-up和top-down paradigm,取决于其分类方法。bottom-up方法基于片段的方法,将短片 tubelet 对的关系分类并将它们合并成较长的视频关系。top-down方法直接分类较长的视频 tubelet 对。虽然使用视频 tubelets 的video-based 方法已经取得了令人瞩目的结果,但我们认为有效的空间和时间建模比选择片段 tubelets 和视频 tubelets 更为重要。这激励我们重新考虑基于片段的分类 paradigm 并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种Hierarchical Context Model (HCM),该模型基于片段来丰富基于对象的空间和基于关系的时间的上下文。我们证明,使用片段 tubelet 可以比大多数基于视频的方法获得更好的性能。此外,使用片段 tubelet 可以在模型设计中提供更多的灵活性,并减轻与视频 tubelets 相关的限制,例如挑战性的长期对象跟踪问题和长期 tubelet 特征压缩中的时间信息丢失问题。在两个挑战性的 VidVRD 基准测试中进行了广泛的实验验证,我们的 HCM 实现了新的先进技术性能,强调了在基于片段的分类 paradigm 内引入高级的空间和时间上下文建模的有效性。
https://arxiv.org/abs/2307.08984