Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.
尽管在图像分类、目标检测和分割等任务上取得了巨大的进步,但识别视觉关系(通常建模为从图像中提取图形)仍然是一项具有挑战性的任务。我们认为这主要是由于目前没有一个标准的方式来处理视觉图识别任务。大多数现有的解决方案都是针对特定问题的,无法直接应用于不同的情境中,尽管概念性的问题是一样的。本着广泛适用性和简单性的原则,在本文中我们开发了一种方法——通过子图预测进行图形识别(GraSP),用于在图像中识别图形。我们在多个合成基准测试和一个现实世界的应用程序上展示了该方法可以处理多种不同类型的图及其绘制方式,并且可以在任务之间转移,而无需特定任务的修改,这为视觉图识别的一个更加统一框架铺平了道路。
https://arxiv.org/abs/2601.15133
Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.
计算机视觉模型在具有连贯视觉模式的子集上出现系统性错误,这些错误被称为误差切片(error slices),对稳健的模型评估构成了重大挑战。现有的切片发现方法主要针对图像分类任务开发,这限制了它们在检测、分割和姿态估计等多实例任务中的应用范围。在实际场景中,误差切片通常源于涉及复杂视觉关系的边缘情况,在这种情况下,现有仅基于单个实例的方法由于缺乏细粒度推理而难以提供有价值的见解。此外,现有的基准测试通常针对特定算法或偏向于图像分类,并且其人工真值无法准确反映模型的实际错误。 为了克服这些局限性,我们提出了SliceLens这一假设驱动的框架,该框架利用大型语言模型(LLMs)和视觉语言模型(VLMs),通过基于场景的视觉推理生成并验证多样化的失败假设,从而可靠地识别出细粒度且具有解释性的误差切片。此外,我们引入了FeSD (Fine-grained Slice Discovery),这是首个专门用于评估实例级计算机视觉任务中细粒度错误切片发现性能的基准测试平台,它包括专家注释和精心调整的真实错误区域。 在现有基准测试和FeSD上的广泛实验表明,SliceLens达到了最先进的性能,在FeSD上将Precision@10从0.31提高到了0.73,并且能够识别出解释性很强的误差切片,这些切片通过模型修复实验证明有助于推动有效的模型改进。
https://arxiv.org/abs/2512.24592
Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at this https URL.
https://arxiv.org/abs/2511.14159
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and this http URL, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
https://arxiv.org/abs/2511.08238
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
多模态大型语言模型(MLLMs)在视觉和语言任务的联合处理方面展示了显著的能力。然而,现有的视觉问答(VQA)基准测试常常无法评估深层语义理解能力,尤其是在像视觉艺术分析这样复杂的领域中。这些问题局限于简单的句法结构和表面属性,未能捕捉到人类视觉查询的多样性和深度。这种局限性促使模型利用统计捷径而不是进行视觉推理。为了解决这一差距,我们引入了VQArt-Bench,这是一个新的大规模VQA基准测试工具,专门针对文化遗产领域设计。该基准测试是通过一个新颖的多代理管道构建的,在这个管道中,专业代理人合作生成细腻、验证过的和语言多样化的提问。由此产生的基准测试涵盖了相关的视觉理解维度,以考察模型解读象征意义、叙述以及复杂视觉关系的能力。我们在这一基准上对14个最先进的MLLMs进行了评估,结果揭示了现有模型中存在的显著局限性,包括在简单的计数任务中的意外薄弱表现,以及专有和开源模型之间的明显性能差距。
https://arxiv.org/abs/2510.12750
Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.
使用自然语言定位3D对象对于机器人场景理解至关重要。这些描述通常涉及多个空间关系来区分相似的对象,从而使得3D和语言之间的对齐变得复杂。目前的方法仅针对成对物体建模其关系,忽略了在多模式关系理解中n-元组合的全局感知重要性。为了解决这个问题,我们提出了一种新颖的渐进式关系学习框架用于3D对象定位。我们将关系学习从二元扩展到n-元,以识别与指称描述相匹配的整体视觉关系。由于训练数据中缺乏具体针对参考对象的注释,我们设计了一个分组监督损失函数来促进n-元关系学习。在使用n-元关系构建的场景图中,我们采用一种具有混合注意力机制的多模式网络进一步在n-元组合内定位目标。我们在ReferIt3D和ScanRefer基准测试上的实验及消融研究证明了我们的方法优于最先进的方法,并且验证了n-元感知关系在3D定位中的优势。
https://arxiv.org/abs/2510.10194
Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures. Thanks to the advances of Vision-Language Models (VLMs), the task of Open-Vocabulary SGG has been recently proposed where models are evaluated on their functionality to learn a wide and diverse range of relations. Current benchmarks in SGG, however, possess a very limited vocabulary, making the evaluation of open-source models inefficient. In this paper, we propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction. Another limitation of Open-Vocabulary SGG is the reliance on weakly supervised data of poor quality for pre-training. We also propose a new solution for quickly generating high-quality synthetic data through region-specific prompt tuning of VLMs. Experimental results show that pre-training with this new data split can benefit the generalization capabilities of Open-Voc SGG models.
场景图生成(SGG)任务旨在将图像中物体之间的视觉关系编码为图形结构。得益于视觉-语言模型(VLMs)的进步,最近提出了开放词汇表的SGG任务,评估模型学习广泛且多样化的关系的能力。然而,现有的SGG基准测试具有的词汇量非常有限,这使得开源模型的效果评估变得效率低下。 本文中,我们提出了一种新的无参考度量标准,旨在公平地评估VLM在关系预测中的开放词汇表能力。另外,开放词汇表的SGG任务还受限于依赖质量较差的弱监督数据进行预训练的问题。为此,我们也提出了一个新方案,通过区域特定提示调优(region-specific prompt tuning)的方法快速生成高质量的合成数据。 实验结果表明,使用这种新的数据划分方式进行预训练可以提升开放词汇表SGG模型的泛化能力。
https://arxiv.org/abs/2509.01209
In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.
在本文中,我们提出了一项新的视频级实例人类-物体交互检测任务,称为ST-HOID(空间时间人类-对象交互检测),该任务旨在区分细粒度的人类-物体互动(HOIs)以及主体和客体的轨迹。这项研究受到这样一个事实的启发:即HOI对于以人为中心的视频内容理解至关重要。为了解决ST-HOID问题,我们提出了一种新颖的方法,包括一个对象轨迹检测模块和一个交互推理模块。此外,为了评估ST-HOID,我们构建了首个名为VidOR-HOID的数据集,该数据集中包含10,831个空间时间HOI实例。我们进行了广泛的实验来评估我们的方法的有效性。实验结果表明,相较于由最先进的图像人类-物体互动检测、视频视觉关系检测和视频人类-物体互动识别等领域的基线方法生成的结果,我们的方法表现出更佳的性能。
https://arxiv.org/abs/2508.17270
Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.
视觉关系检测(VRD)的任务是识别场景中物体之间的关系。仅基于关系检测数据训练的VRD模型难以泛化到它们未见过的关系上。虽然提示微调已经被用来适应视觉-语言模型(VLMs)以进行VRD,但这种方法使用的是手工设计的提示,并且在处理新颖或复杂的关系时效果不佳。我们认为,指令微调提供了一种更为有效的解决方案,通过在多样化的指令数据上对VLM进行精细调整来实现这一目标。 因此,我们引入了ART(自适应关系微调框架),该框架利用指令微调和策略性实例选择的方式将VLM适配到VRD中。通过将VRD数据集转换为指令微调格式,并使用自适应采样算法,ART指导模型关注具有信息性的关系同时保持泛化能力。 具体而言,我们专注于关系分类任务,在这个任务中给出了主体-客体框,模型需要预测它们之间的谓词。我们在预留的训练集上进行微调,并在多个不同复杂度级别的未见过的数据集上进行了评估。我们的方法显著优于基线方法,并且能够推理出未曾见过的关系概念,这是主流VRD方法所不具备的能力。 我们通过使用预测关系来分割复杂的场景展示了ART的实际应用价值。
https://arxiv.org/abs/2507.23543
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
提示学习已经被广泛采用,以高效地调整视觉-语言模型(如CLIP)以适应各种下游任务。尽管取得了成功,现有的基于VLM的面部表情识别(FER)方法在捕捉细微的文本-视觉关系方面仍存在困难,而这对于区分面部表情之间的微小差异至关重要。为了解决这一挑战,我们提出了一种用于FER的多模态提示对齐框架,称为MPA-FER,该框架向提示学习过程提供了细粒度的语义指导,从而产生了更精确且可解释的表示。 具体来说,我们引入了一个多层次硬提示生成策略,利用像ChatGPT这样的大型语言模型(LLM)为每种面部表情生成详细的描述。通过最小化软提示与硬提示之间的特征差异,将基于LLM的外部知识注入到软提示中。为了保持预训练CLIP模型的泛化能力,我们的方法采用了原型引导的视觉特性对齐机制,确保来自冻结图像编码器的提示视觉特性能够紧密地与特定类别的原型一致。 此外,我们提出了一种跨模态全局局部对齐模块,专注于表情相关的面部特征,进一步提高了文本和视觉特性之间的对齐。广泛的实验表明,在三个FER基准数据集上,我们的框架优于最先进的方法,同时保持了预训练模型的优势并减少了计算成本。
https://arxiv.org/abs/2506.21017
Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
基于视觉关系(如空间、功能、交互和社会关系)的推理被认为是人类认知的基本组成部分。尽管多模态语言模型(MLMs)在视觉理解方面取得了重大进展,但在准确推断关系及其生成上仍面临挑战。为此,我们引入了ROBIN:这是一种通过密集注释的关系进行微调的MLM,能够大规模构建高质量的密集场景图。 为了训练ROBIN,我们整理了SVG数据集——这是一个合成的场景图数据集。该数据集通过对现有场景图中选定对象缺失关系的补全而生成,使用教师MLM和精心设计的过滤过程以确保高质量的数据质量。为在任何图像上生成更准确、更丰富的场景图,我们引入了一个自我蒸馏框架SG-EDIT:通过GPT-4o进一步精炼ROBIN预测的场景图,删除不太可能的关系并/或建议相关关系。 总计,我们的数据集包含146K张图片和5.6M种关系,涉及2.6M个对象。实验结果表明,尽管训练样本少于300万,但我们的ROBIN-3B模型在关系理解基准测试中超越了相同大小的、使用超过3亿实例进行训练的模型,并且甚至超过了多达130亿参数的大规模模型。特别值得一提的是,在指代表达理解任务上取得了88.9的成绩(前最佳成绩为87.4),达到了当前最优水平。我们的研究结果表明,利用经过精炼的场景图数据对模型进行训练对于在各种视觉推理任务中保持高水平性能至关重要。
https://arxiv.org/abs/2506.07643
Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.
理解物体之间的关系是视觉智能的核心,这在具身人工智能、辅助系统和场景理解中有着广泛的应用。然而,大多数视觉关系检测(VRD)模型依赖于固定的谓词集,限制了它们对新出现的交互的泛化能力。一个关键挑战在于无法将语义上合理但未标注的关系与外部知识相结合,并在视觉上进行定位。这项工作引入了一个迭代的视觉定位框架,该框架利用大型语言模型(LLMs)作为结构化的关系先验。受期望最大化(EM)算法启发,我们的方法通过交替使用LLM生成候选场景图(期望步骤)和训练视觉模型以使这些假设与感知证据对齐(极大化步骤),从而在注释数据之外引导关系理解,并能够泛化到未见过的谓词上。此外,我们还在Visual Genome上为开放世界的VRD引入了一个新的基准测试,在这个测试中有21个保留的谓词,并且我们在三种设置下进行了评估:已见(seen)、未知(unseen)和混合(mixed)。我们的模型在谓词分类上超越了仅使用LLM、少量样本学习(few-shot)和去偏基线,分别在这三个数据集上的平均召回率(mR@50)达到了15.9、13.1和11.7。这些结果突显了以视觉为基础的LLM先验在可扩展开放世界视觉理解中的潜力。
https://arxiv.org/abs/2506.05651
Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.
受大型语言模型(LLM)的上下文学习机制启发,一种新的基于视觉提示的一般化图像编辑范式正在兴起。现有的单参考方法通常集中于风格或外观调整,并且在处理非刚性变换时面临挑战。为了解决这些限制,我们提出利用源目标图像对来提取并转移内容感知编辑意图到新的查询图像上。为此,我们引入了 RelationAdapter,这是一个轻量级模块,使基于扩散变压器(DiT)的模型能够从少量示例中有效捕捉和应用视觉变换。此外,我们还推出了包含218种多样化编辑任务的Relation252K数据集,用于评估在视觉提示驱动场景中的模型泛化能力和适应性。 在 Relation252K 上进行的实验表明,RelationAdapter 显著提高了模型理解并转移编辑意图的能力,从而在生成质量和总体编辑性能方面取得了显著提升。
https://arxiv.org/abs/2506.02528
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
我们介绍了OvSGTR,这是一种基于变压器的全新框架,用于全开放式词汇量场景图生成,克服了传统封闭集模型的局限性。传统的方法将物体和关系识别限制在一个固定的词汇表中,这妨碍了它们在新概念频繁出现的真实世界场景中的应用。相比之下,我们的方法同时预测超出预定义类别的对象(节点)及其相互关系(边)。OvSGTR采用类似于DETR的架构,包括冻结的图像骨干网络和文本编码器来提取高质量的视觉和语义特征,并通过变压器解码器融合这些特征以进行端到端场景图预测。为了丰富模型对复杂视觉关系的理解,我们提出了一种基于关系感知的预训练策略,在弱监督下综合生成场景图注释。具体而言,我们研究了三种管道——基于场景解析器、基于大型语言模型(LLM)和多模态LLM的方法——以利用最少的手动标注生成可转移的监督信号。此外,为了解决开放式词汇设置中常见的灾难性遗忘问题,我们引入了一种结合视觉概念保留机制与知识蒸馏策略的方法,在微调过程中确保模型保持丰富的语义线索。在VG150基准测试上的广泛实验表明,OvSGTR在封闭集、基于开放词汇对象检测的、关系驱动型和完全开放式词汇量等多种设置下均达到了最先进的性能水平。我们的结果强调了大规模关系感知预训练和变压器架构对于推进场景图生成向更通用和可靠视觉理解方向发展的前景。
https://arxiv.org/abs/2505.20106
Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance.
开放词汇视频视觉关系检测的目标是在不受预定义对象或关系类别限制的情况下,识别视频中的物体及其关系。现有方法利用如CLIP等预训练的视觉-语言模型丰富的语义知识来识别新型类别。然而,这些方法通常采用级联流水线,先检测出物体再基于这些物体分类它们之间的关系,这种做法可能导致错误传播并导致性能欠佳。 本文中我们提出了互增强对象和关系框架(Mutual EnhancemenT of Objects and Relationships, METOR),这是一个以查询为基础的统一框架,在开放词汇场景下同时建模和相互增强目标检测与关系分类。在该框架内,首先设计了一个基于CLIP的上下文细化编码模块,用于提取物体和关系的视觉背景,并以此来改进文本特征和对象查询的编码,从而提高对新类别泛化的编码能力。然后提出了一种迭代增强模块,通过充分利用其相互依赖性交替地增强对象与关系的表现形式,以提升识别性能。 在两个公开数据集VidVRD和VidOR上进行广泛实验后证明了该框架实现了当前最优的性能。
https://arxiv.org/abs/2505.06663
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.
最近在多模态大型语言模型(MLLMs)方面的进展显著提高了对象级定位和区域描述的能力,但在视觉关系理解方面仍存在局限性(例如,场景图生成),特别是在建模涉及事件中多个实体的多重语义角色的\textit{N}-元关系时。这种对多实体之间缺乏\textit{语义依赖}模型导致了不可靠的结果,加剧了MLLMs的幻想倾向和过度依赖语言先验知识。 为此,我们提出了Relation-R1,这是第一个统一的关系理解框架,它明确地在强化学习(RL)范式内整合了认知链式思考(CoT)引导的监督微调(SFT)和组相对策略优化(GRPO)。具体来说,我们首先通过SFT建立了基础推理能力,并强制执行结构化的输出和思维过程。然后利用GRPO通过多奖励优化来精炼这些输出,优先考虑视觉语义定位而非语言诱导偏差,从而提高泛化能力。 在广泛使用的PSG和SWiG数据集上的大量实验表明,Relation-R1在二元关系理解和\textit{N}-元关系理解方面都达到了最先进的性能。
https://arxiv.org/abs/2504.14642
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)的目标是识别图像中对象对之间的关系(或互动)。尽管最近的VRD模型已经取得了令人印象深刻的表现,但它们都局限于预定义的关系类别,并未能考虑视觉关系所具有的语义模糊性特征。与物体不同,视觉关系的外观总是微妙的,可以从不同的视角用多个谓词单词来描述,例如,“骑”可以分别从体育和空间位置的角度描绘为“比赛”和“坐在上面”。为此,我们提出将视觉关系建模为连续嵌入,并设计扩散模型以在条件生成方式下实现泛化的VRD,命名为Diff-VRD。我们在隐式空间中对扩散过程进行建模,并生成图像中的所有可能的关系作为嵌入序列。在生成过程中,主体-客体对的视觉和文本嵌入充当条件信号并通过交叉注意力机制注入其中。生成之后,我们设计了一个后续匹配阶段以根据它们的语义相似性将关系词分配给主体-客体对。得益于基于扩散的生成过程,我们的Diff-VRD能够生成超出数据集预定义类别标签的视觉关系。为了适当评估这项泛化的VRD任务,我们引入了两个评价指标,即文本到图像检索和灵感来自图像字幕的SPICE PR曲线。在人类对象交互(HOI)检测和场景图生成(SGG)基准测试中的广泛实验证明了Diff-VRD的优越性和有效性。
https://arxiv.org/abs/2504.12100
The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.
音频驱动的一次性面部动画(ADOS-THA)面临的最大挑战在于捕捉相邻视频帧之间细微的变化。本质上,连续音频片段之间的时序关系与相应的连续视频帧的时序关系高度相关,提供了可以对头部动作动画进行引导和监督的重要补充信息。在本研究中,我们提出了一种学习音视频关联,并通过一种新颖的时序音视频关联嵌入(TAVCE)框架将这些关联整合起来以增强特征表示并规范化最终生成的方法。 具体而言,该方法首先学习一个音频-视觉时间相关性度量,确保连续音频片段之间的时序关系与相应连续视频帧之间的时序关系对齐。由于时序音频关系包含了有关视觉帧的对准信息,我们首先通过一种简单但有效的通道注意力机制将其整合进来,以指导学习更具代表性的特征。在训练过程中,我们也利用这些对齐的相关性作为额外目标来监督生成视觉帧。 我们在几个公开的数据集(即HDTF、LRW、VoxCeleb1和VoxCeleb2)上进行了广泛的实验,证明了我们提出的方法优于现有的领先算法。
https://arxiv.org/abs/2504.05746
Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.
视频问答(Video-Question-Answering,简称 VideoQA)涉及到捕捉随时间变化的复杂视觉关系,这对先进的视频语言模型(VLMs)来说仍是一个挑战。这种困难部分源于需要将视觉内容表示为适合这些模型处理的合理大小输入的问题。为了应对这一问题,我们提出了基于关系的视频表征学习框架(RElation-based Video rEpresentAtion Learning, REVEAL)。该框架旨在通过编码结构化、分解式的表征来捕捉视觉关系信息。 具体来说,受时空场景图的启发,我们提出将视频序列在随时间变化的过程中以语言嵌入的形式表示为一组关系三元组(即形式为“主体-谓词-客体”的关系)。为此,我们从视频字幕中提取明确的关系,并结合Many-to-Many噪声对比估计(MM-NCE)和Q-Former架构来对齐无序的视频衍生查询集与相应的文本基础关系描述。在推理阶段,生成的Q-former能够产生一个高效的令牌表示形式,可以作为输入提供给VLM进行VideoQA任务。 我们在五个具有挑战性的基准测试(NeXT-QA、Intent-QA、STAR、VLEP和TVQA)上评估了该框架的表现。结果显示,基于查询的视频表征在与全局对齐基础CLF或patch令牌表示相比时能够胜出,并且在需要时间推理和关系理解的任务中,其表现可以匹敌当前最先进的模型。代码和模型将公开发布。
https://arxiv.org/abs/2504.05463
Flexible objects recognition remains a significant challenge due to its inherently diverse shapes and sizes, translucent attributes, and subtle inter-class differences. Graph-based models, such as graph convolution networks and graph vision models, are promising in flexible objects recognition due to their ability of capturing variable relations within the flexible objects. These methods, however, often focus on global visual relationships or fail to align semantic and visual information. To alleviate these limitations, we propose a semantic-enhanced heterogeneous graph learning method. First, an adaptive scanning module is employed to extract discriminative semantic context, facilitating the matching of flexible objects with varying shapes and sizes while aligning semantic and visual nodes to enhance cross-modal feature correlation. Second, a heterogeneous graph generation module aggregates global visual and local semantic node features, improving the recognition of flexible objects. Additionally, We introduce the FSCW, a large-scale flexible dataset curated from existing sources. We validate our method through extensive experiments on flexible datasets (FDA and FSCW), and challenge benchmarks (CIFAR-100 and ImageNet-Hard), demonstrating competitive performance.
灵活物体识别仍然面临重大挑战,原因在于其形状和大小的多样性、半透明属性以及细微的类别间差异。基于图的方法(如图卷积网络和图形视觉模型)在灵活物体识别中展现出巨大潜力,因为它们能够捕捉到灵活物体内部可变的关系。然而,这些方法通常侧重于全局视觉关系或无法对齐语义与视觉信息。为了解决这些问题,我们提出了一种增强的异构图学习方法。 首先,采用了一个自适应扫描模块来提取具有判别性的语义上下文,这有助于匹配形状和大小各异的灵活物体,并通过对齐语义和视觉节点来加强跨模态特征的相关性。其次,通过一个异构图生成模块聚合全局视觉与局部语义节点特征,从而提高对灵活物体的识别能力。 此外,我们引入了一个名为FSCW的大规模灵活数据集,该数据集是从现有来源汇编而来的。我们在灵活数据集(FDA和FSCW)以及具有挑战性的基准测试(CIFAR-100和ImageNet-Hard)上通过广泛的实验验证了我们的方法的有效性,并展示了其在性能上的竞争力。
https://arxiv.org/abs/2503.22079