Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
开放词汇场景图生成(OV-SGG)通过将视觉关系表示与开放式词汇文本表示对齐,克服了闭集假设的限制。这使得识别新颖的视觉关系成为可能,并使其适用于具有多样化关系的真实世界场景。然而,现有的OV-SGG方法受到固定文本表示方式的约束,这对图像和文本之间的多样性和准确性匹配造成了限制。 为了应对这些挑战,我们提出了基于关系感知分层提示(RAHP)框架,该框架通过整合主体-客体及区域特定的关系信息来增强文本表示。我们的方法利用实体聚类来处理关系三元组类别复杂性的问题,并有效地集成主体-客体信息。此外,我们还使用大型语言模型(LLM)生成详细且区域感知的提示词,捕捉细微的视觉互动并提高视觉与文本模态之间的对齐度。 RAHP框架还引入了在视觉-语言模型(VLMs)中的动态选择机制,该机制根据视觉内容自适应地选择相关文本提示,从而减少了无关提示造成的噪声干扰。我们在Visual Genome和Open Images v6数据集上的广泛实验表明,我们的框架能够持续达到最先进的性能水平,证明了它在解决开放词汇场景图生成挑战方面的有效性。 综上所述,通过提出基于关系感知分层提示(RAHP)的框架,我们不仅增强了文本表示的能力,还提高了视觉-语言模型在处理复杂、多样化场景时的表现,从而为开放词汇场景图生成提供了一种有效的解决方案。
https://arxiv.org/abs/2412.19021
In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, {untrimmed} videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel \ul{CC}Net, comprising two core modules: the Cross-Modal Consistency \ul{C}ollaboration (CMCC) and the Multi-Temporal Granularity \ul{C}ollaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at \url{this https URL}.
在视听学习领域,大多数研究任务仅关注短片。本文聚焦于更为实用的密集视听事件定位(DAVEL)任务,推动了对更长、未剪辑视频的视听场景理解的发展。该任务旨在同时识别并精确定位音频和视频流中所有同时发生的事件。通常情况下,每个视频包含多个类别的密集事件,并且这些事件可能在时间线上重叠,各自持续的时间也不同。鉴于这些挑战,有效地利用跨模态关系以及以各种粒度编码的时间特征变得至关重要。为了解决这些问题,我们引入了一种新的\ul{CC}Net,它包括两个核心模块:跨模态一致性协作(CMCC)和多时间粒度协作(MTGC)。具体而言,CMCC模块包含两条分支:一条跨模态交互分支和一条时间一致性门控分支。前一支路通过编码视听关系促进跨模态一致事件语义的聚合,而后一支路引导一种模式聚焦于在另一种模式中识别到的关键事件相关时间区域。MTGC模块包括一个粗至细协作块和一个细至粗协作块,为粗粒度和细粒度的时间特征提供双向支持。我们在UnAV-100数据集上的广泛实验验证了我们的模块设计的有效性,并取得了密集视听事件定位的新最佳性能。代码可在\url{此 https URL}获取。
https://arxiv.org/abs/2412.12628
Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks, Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation, designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.
时空场景图(STSGs)通过建模对象及其随时间演变的关系,为动态场景提供了简洁且表达力强的表示。然而,现实世界的视觉关系通常表现出长尾分布,导致现有的视频场景图生成(VidSGG)和场景图预期(SGA)任务方法产生有偏见的场景图。为此,我们提出了ImparTail,这是一个利用课程学习和损失掩码来减轻STSG生成和预期中偏差的新颖训练框架。我们的方法在训练过程中逐步减少头部关系类别的主导地位,并更多地关注尾部类别,从而实现更平衡的训练。此外,我们引入了两个新任务:鲁棒时空场景图生成和鲁棒场景图预期,旨在评估STSG模型对分布变化的鲁棒性。Action Genome数据集上的大量实验表明,与现有方法相比,我们的框架显著提升了STSG模型的无偏性能和鲁棒性。
https://arxiv.org/abs/2411.13059
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.
https://arxiv.org/abs/2410.15364
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
人类在理解视觉关系方面的能力远远优于AI系统,特别是对于之前未见过的物体。例如,AI系统在确定两个此类物体是否在视觉上相同或不同时会感到困惑,而人类则可以轻松地做到这一点。积极视觉理论认为,学习视觉关系是基于我们移动眼睛来固定物体及其部分的行为。特别是,关于相应眼动低维空间信息的假设,有助于促进不同图像部分之间的关系表示。受到这些理论的启发,我们开发了一种名为Glimpse-based Active Perception(GAP)的新系统,该系统在输入图像的最具突出性的区域进行序列性浏览,并对其进行高分辨率处理。重要的是,我们的系统利用浏览行动产生的位置以及它们周围的视觉内容来表示图像不同部分之间的关系。结果显示,GAP对于提取超越当前视觉内容的视觉关系至关重要。我们的方法在几个视觉推理任务上达到了最先进的性能,具有更高的样本效率,并且对分布不在前的模型的泛化更好。
https://arxiv.org/abs/2409.20213
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.
开放词汇视频视觉关系检测旨在将视频视觉关系检测扩展到注释类别的范围之外,通过检测视频中的未见关系来识别可见和未见对象之间的关系。现有的方法通常使用在关闭数据集上训练的轨迹检测器来检测物体轨迹,然后将轨迹输入到大规模预训练的视觉语言模型中,以实现开放词汇分类。然而,对预训练轨迹检测器的依赖使得它们无法扩展到新颖物体类别,导致性能下降。为了解决这个问题,我们提出将物体轨迹检测和关系分类统一成一个端到端的开放词汇框架。在这个框架中,我们提出了一个关系意识到的开放词汇轨迹检测器。它主要由一个基于查询的Transformer解码器组成,其中CLIP的视觉编码器在帧级别进行离心化以实现开放词汇物体检测,和一个轨迹关联器。为了在轨迹检测过程中利用关系上下文,我们在Transformer解码器中嵌入了一个关系查询,相应地,设计了一个辅助关系损失,以使解码器能够直观地感知物体之间的关系。此外,我们还提出了一个利用CLIP丰富语义知识来发现新关系的开放词汇关系分类器。为了使CLIP更好地适应关系分类,我们设计了一个多模态提示方法,该方法采用空间时间视觉提示进行视觉表示,并使用视觉指导语言提示进行语言输入。在两个公开数据集VidVRD和VidOR上的大量实验证明了我们框架的有效性。我们的框架还应用于更困难的跨数据集场景,以进一步证明其泛化能力。
https://arxiv.org/abs/2409.12499
Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at this https URL.
与现实场景中杂乱的交互对机器人代理来说,理解观察对象之间的复杂空间依赖关系以确定最优的抓取序列或有效的对象检索策略面临着挑战。现有的解决方案通常处理简化场景,并专注于在初始物体检测阶段预测成对物体关系,但往往忽视全局上下文或者在处理冗余或缺失的物体关系方面遇到困难。在这项工作中,我们提出了一个现代的视觉关系推理 grasp planning 的视角。我们引入了 D3GD,一种包含 97 个不同类别的 35 个物体的 bin 选择场景。此外,我们提出了 D3G,一种新的端到端 Transformer-based 依赖关系图生成模型,它同时检测物体并生成表示它们空间关系的邻接矩阵。为了识别标准指标的局限性,我们首次使用关系精度(RPN)对模型性能进行评估,进行了一项广泛的实验基准。所得到的结果使我们将其方法确定为这一任务的最新状态,为未来的机器人操作研究奠定了基础。我们公开发布了这段代码和数据集的 URL。
https://arxiv.org/abs/2409.02035
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
要准确理解工程图纸,有必要在图纸中建立图像与其描述表之间的对应关系。现有的文档理解方法主要关注文本作为主要模式,这并不适合包含大量图像信息的文档。在可视关系检测领域,任务的结构使其无法评估图纸中所有实体对之间的关系。为解决这个问题,我们提出了一个基于视觉关系的检测模型,名为ViRED,用于识别电气工程图纸中表与电路之间的关联。我们的模型主要由三个部分组成:一个视觉编码器、一个对象编码器和一个关系解码器。我们使用PyTorch实现ViRED,以评估其性能。为了验证ViRED的有效性,我们进行了一系列实验。实验结果表明,在我们的工程图纸数据集上,我们的方法在关系预测任务上的准确度达到了96%,标志着与现有方法相比取得了显著的改进。结果还显示,即使在一个工程图纸中有大量的物体,ViRED仍可以在快速速度下进行推理。
https://arxiv.org/abs/2409.00909
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
利用音频和视频模态进行视频分类是一个具有挑战性的任务,因为现有的方法需要大型模型架构,导致高计算复杂度和资源需求。较小的架构则很难实现最优性能。在本文中,我们提出了Attend-Fusion,一种专为捕捉视频数据中复杂的音频-视觉关系而设计的音频-视觉(AV)融合方法。通过对具有挑战性的YouTube-8M数据集的广泛实验,我们证明了Attend-Fusion达到75.64\%的F1得分,仅使用72M参数,与具有类似性能的大型基线模型(如Fully-Connected Late Fusion,75.96\% F1 score,341M parameters)相当。Attend-Fusion在大型基线模型的同时减小了模型大小,凸显了其在模型复杂度方面的效率。我们的工作表明,Attend-Fusion模型能够有效地结合音频和视频信息进行视频分类,实现与显著减小模型大小相当的竞争性能。这种方法为在各种应用环境中部署高效的视频理解系统提供了新的可能性。
https://arxiv.org/abs/2408.14441
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{this https URL}{this https URL}}.
视频视觉关系检测(VidVRD)关注实体在视频中的交互,是深入了解视频场景的关键步骤,超出了基本的视觉任务。传统方法在面对其复杂性时,通常将任务分为两个部分:一个是确定关系类别的存在,另一个是确定它们的时域边界。这种划分忽略了这些元素之间的固有联系。为了识别跨越不同持续时间的关系实体对,我们提出了VrdONE,一种简洁而有效的单阶段模型。VrdONE结合了主题和对象的特征,将谓词检测转换为他们联合表示的1D实例分割。这个设置允许在一次性识别关系类别和生成二进制掩码的同时,消除需要提议生成或后处理等额外步骤的需求。VrdONE在各种帧之间的特征交互方面表现出色,能够捕捉到短暂的和持久的关系。此外,我们引入了主题-对象协同(SOS)模块,提高了主题和对象在结合前如何相互感知。VrdONE在VidOR基准和ImageNet-VidVRD上实现了最先进的性能,展示了其在不同时间尺度上分辨关系的卓越能力。代码可在此处获得:\textcolor[R{228,58,136}]{\href{this <https://this <https://this> URL>}{this <https://this> URL}}。
https://arxiv.org/abs/2408.09408
Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 $mAP$ on HICO-DET, +11.4 $Acc$ on VRD, +4.7 $mAP$ on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.
视觉关系理解已经在人机交互(HOI)检测、场景图生成(SGG)和引用关系(RR)任务中进行了单独研究。考虑到这些任务的复杂性和相互关联性,有必要提供一个灵活的框架,能够以一种集成的方式有效地解决这些任务。在本文中,我们提出了FleVRS,一种将上述三个方面无缝集成在标准和提示性视觉关系分割中的单一模型,并进一步具有适应新场景的开放词汇分割能力。FleVRS利用文本和图像模态之间的协同作用,将各种类型的关系从图像中 grounded,并使用视觉语言模型中的文本特征进行视觉概念理解。通过不同数据集的实证验证,我们的框架在标准、提示性和开放词汇任务中优于现有模型,例如+1.9 $mAP$在HICO-DET,+11.4 $Acc$在VRD,+4.7 $mAP$在未见过的HICO-DET上。FleVRS代表了一个明显朝着更直观、全面和可扩展的视觉关系理解方向迈出的重要一步。
https://arxiv.org/abs/2408.08305
Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by leveraging the relationships and shared cues between the audio-visual modalities. As a result, our method can provide accurate answers by effectively utilizing available information even when input modalities are missing. We believe our method holds potential applications not only in AVQA research but also in various multi-modal scenarios.
https://arxiv.org/abs/2407.16171
Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.
文本基于人员搜索(TBPS)是一个在研究社区中引起广泛关注的问题。任务是根据文本描述检索一或多个特定个人的图像。任务的多样性要求学习在共享潜在空间中连接文本和图像数据的表示。现有的TBPS系统面临着两个主要挑战。一个是由于文本描述的固有模糊和不精确性而产生的身份混淆噪声,它表明了视觉属性的描述如何通常与不同的人相关联;另一个是内 Identity Variations,它们都是那些例如姿态、照明等可以改变给定主题文本属性的视觉外观的细微差别。为了应对这些问题,本文提出了一种名为MARS( Mae-Attribute-Relation-Sensitive)的新TBPS架构,它通过引入两个关键组件来增强现有技术的水平:视觉重构损失和属性损失。前一个采用基于文本描述的随机遮罩自动编码器来重构图像补丁。这样做,模型被鼓励在潜在空间中学习更富有表现力的表示和文本-视觉关系。相反,属性损失平衡了不同属性的贡献,这些属性定义为形容词短语文本。这种损失确保了在人员检索过程中考虑到了每个属性。在三个常用的数据集(CUHK-PEDES,ICFG-PEDES和RSTPReid)上进行的大量实验报告显示,性能得到了提高,特别是平均精度(mAP)指标与现有技术的水平相比显著增益。
https://arxiv.org/abs/2407.04287
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
幻觉问题一直是现有大型视觉语言模型(LVLMs)中的一个普遍关注点。之前的努力主要集中在研究物体幻觉,通过引入物体检测器可以轻松缓解这种幻觉。然而,这些努力忽视了物体关系幻觉,这对于视觉理解是至关重要的。在这项工作中,我们引入了R-Bench,一种用于评估视觉关系幻觉的新基准。R-Bench包括针对关系和实例水平的图像级问题,这些问题关注关系的存在以及评估局部视觉理解。我们确定了三种导致幻觉的关系共现类型:关系关系、主体关系和关系物体。视觉指令调整数据集的长尾分布显著影响了LVLMs对视觉关系的理解。此外,我们的分析发现,当前的LVLMs往往忽视视觉内容,过于依赖常识知识的大型语言模型。他们还难以根据上下文信息进行空间关系推理。
https://arxiv.org/abs/2406.16449
Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failure points at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
尽管视觉 transformer(ViTs)在各种设置中已经取得了最先进的性能,但在涉及视觉关系任务的操作中,它们表现出了令人惊讶的失败。这引发了一个问题:ViTs 尝试如何执行涉及计算物体之间视觉关系任务的任务?之前对 ViTs 的解释性尝试通常集中在刻画相关低级视觉特征上。相比之下,我们采用机制解释性方法研究 ViTs 使用的高层次视觉算法来进行抽象视觉推理。我们举一个基本但非常困难的关系推理任务为例:判断两个视觉实体是否相同或不同。我们发现,预训练的 ViTs 在这个任务上进行微调时,往往表现出两种质量不同的处理阶段,尽管它们没有明显的归纳偏见:1)感知阶段,其中局部物体特征被提取并存储在分离表示中;2)关系阶段,其中物体表示进行比较。在第二阶段,我们发现 ViTs 确实可以学会表示 somewhat abstract visual relations,这是人工智能神经网络长期以来认为不可能的能力。最后,我们证明了在第二个阶段,失败点可以阻止模型学习到我们任务的高层次通用解决方案。通过理解 ViTs 在离散处理阶段,我们可以更精确地诊断和纠正现有和未来模型的不足。
https://arxiv.org/abs/2406.15955
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at \url{this https URL}.
动态场景图生成(DSGG)关注视频的空间-时间域内的视觉关系。传统的解决方案通常采用多阶段流程,通常包括目标检测、时间关联和多关系分类。然而,由于多个阶段的分离,这些方法存在固有局限性,而独立优化这些子问题可能会产生次优解决方案。为了弥补这些局限性,我们提出了一个端到端的框架,称为OED,该框架简化了DSGG管道。该框架将任务重新建模为预测问题,并利用成对特征表示场景图中的每个主题-对象对。此外,DSGG的一个挑战是捕捉时间依赖关系,我们引入了一个逐渐精炼的模块(PRM),用于汇总没有额外跟踪器或手工制作的轨迹的元数据,从而实现网络端到端的优化。在Action Genome基准上进行的大量实验证明了我们设计的有效性。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2405.16925
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
场景图生成(SGG)旨在将视觉场景分解为中间图表示,以供下游推理任务使用。尽管最近取得了进展,但现有的方法在生成具有新颖视觉关系概念的场景图时仍存在困难。为解决这一挑战,我们引入了一种基于序列生成的全新开放词汇SGG框架。我们的框架利用了视觉语言预训练模型(VLM),并引入了图像到图生成范式。具体来说,我们通过VLM的图像到文本生成生成场景图序列,然后从这些序列中构建场景图。通过这样做,我们充分利用VLM的强大的能力实现开放词汇的SGG,并通过显式关系建模增强VL任务的性能。实验结果表明,我们的设计不仅实现了更好的性能,而且通过显式关系建模知识增强了下游视觉语言任务的表现。
https://arxiv.org/abs/2404.00906
Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: this https URL.
尽管它们具有出色的生成能力,大型文本到图像扩散模型(如熟练但粗心的艺术家)通常在准确描绘物体之间的视觉关系方面遇到困难。通过仔细分析,我们发现这一问题源于一个失衡的文本编码器,它难以解释具体的关系,并区分相关对象的逻辑顺序。为解决这个问题,我们引入了一个名为关系纠正的新任务,旨在优化模型以准确表示其最初无法生成的关系。为解决这一问题,我们提出了一种创新的方法利用异质图卷积网络(HGCN)。它通过输入提示来建模关系词汇之间和相应物体之间的方向关系。具体来说,我们在一对具有相同关系词但反向物体顺序的提示上优化HGCN,并补充了几个参考图像。轻量级的HGCN调整了由文本编码器生成的文本嵌入,确保了文本中关系的准确映射在嵌入空间中的反映。关键的是,我们的方法保留了文本编码器和解扩散模型的参数,保持模型在不相关描述上的稳健性能。我们在包含多样关系数据的新数据集中评估了我们的方法,证明了在生成精确视觉关系图片方面 both quantitative and qualitative enhancements。项目页面:this <https://this URL>.
https://arxiv.org/abs/2403.20249
Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at this https URL.
视觉关系检测(VRD)在最近使用Transformer架构取得了显著的进步。然而,我们发现在传统的标签分配过程中存在两个关键限制,这是将真实(GT)映射到预测的过程。在传统分配中,由于预计查询需要检测每个关系,因此查询需要 specialized。此外,由于 GT 只分配给单个预测,因此即使预测接近正确或正确,也会因为分配无关系而被压制。为了解决这些问题,我们提出了组内查询专业化和质量感知多分配(SpeaQ)。组内查询专业化通过将查询和关系划分为独立组,并仅将查询定向向相应关系组中的关系来训练专用查询。质量感知多分配进一步通过将一个 GT 分配给多个预测,这些预测与 GT 在主体、对象和它们之间的关系上非常接近,来促进训练。实验结果和分析表明,SpeaQ有效地训练了专用查询,这更好地利用了模型的能力,从而在多个 VRD 模型和基准上实现了显著的性能提升。代码可在此处访问:https://www.aclweb.org/anthology/N22-11969
https://arxiv.org/abs/2403.17709