Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
提示学习已经被广泛采用,以高效地调整视觉-语言模型(如CLIP)以适应各种下游任务。尽管取得了成功,现有的基于VLM的面部表情识别(FER)方法在捕捉细微的文本-视觉关系方面仍存在困难,而这对于区分面部表情之间的微小差异至关重要。为了解决这一挑战,我们提出了一种用于FER的多模态提示对齐框架,称为MPA-FER,该框架向提示学习过程提供了细粒度的语义指导,从而产生了更精确且可解释的表示。 具体来说,我们引入了一个多层次硬提示生成策略,利用像ChatGPT这样的大型语言模型(LLM)为每种面部表情生成详细的描述。通过最小化软提示与硬提示之间的特征差异,将基于LLM的外部知识注入到软提示中。为了保持预训练CLIP模型的泛化能力,我们的方法采用了原型引导的视觉特性对齐机制,确保来自冻结图像编码器的提示视觉特性能够紧密地与特定类别的原型一致。 此外,我们提出了一种跨模态全局局部对齐模块,专注于表情相关的面部特征,进一步提高了文本和视觉特性之间的对齐。广泛的实验表明,在三个FER基准数据集上,我们的框架优于最先进的方法,同时保持了预训练模型的优势并减少了计算成本。
https://arxiv.org/abs/2506.21017
Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
基于视觉关系(如空间、功能、交互和社会关系)的推理被认为是人类认知的基本组成部分。尽管多模态语言模型(MLMs)在视觉理解方面取得了重大进展,但在准确推断关系及其生成上仍面临挑战。为此,我们引入了ROBIN:这是一种通过密集注释的关系进行微调的MLM,能够大规模构建高质量的密集场景图。 为了训练ROBIN,我们整理了SVG数据集——这是一个合成的场景图数据集。该数据集通过对现有场景图中选定对象缺失关系的补全而生成,使用教师MLM和精心设计的过滤过程以确保高质量的数据质量。为在任何图像上生成更准确、更丰富的场景图,我们引入了一个自我蒸馏框架SG-EDIT:通过GPT-4o进一步精炼ROBIN预测的场景图,删除不太可能的关系并/或建议相关关系。 总计,我们的数据集包含146K张图片和5.6M种关系,涉及2.6M个对象。实验结果表明,尽管训练样本少于300万,但我们的ROBIN-3B模型在关系理解基准测试中超越了相同大小的、使用超过3亿实例进行训练的模型,并且甚至超过了多达130亿参数的大规模模型。特别值得一提的是,在指代表达理解任务上取得了88.9的成绩(前最佳成绩为87.4),达到了当前最优水平。我们的研究结果表明,利用经过精炼的场景图数据对模型进行训练对于在各种视觉推理任务中保持高水平性能至关重要。
https://arxiv.org/abs/2506.07643
Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.
理解物体之间的关系是视觉智能的核心,这在具身人工智能、辅助系统和场景理解中有着广泛的应用。然而,大多数视觉关系检测(VRD)模型依赖于固定的谓词集,限制了它们对新出现的交互的泛化能力。一个关键挑战在于无法将语义上合理但未标注的关系与外部知识相结合,并在视觉上进行定位。这项工作引入了一个迭代的视觉定位框架,该框架利用大型语言模型(LLMs)作为结构化的关系先验。受期望最大化(EM)算法启发,我们的方法通过交替使用LLM生成候选场景图(期望步骤)和训练视觉模型以使这些假设与感知证据对齐(极大化步骤),从而在注释数据之外引导关系理解,并能够泛化到未见过的谓词上。此外,我们还在Visual Genome上为开放世界的VRD引入了一个新的基准测试,在这个测试中有21个保留的谓词,并且我们在三种设置下进行了评估:已见(seen)、未知(unseen)和混合(mixed)。我们的模型在谓词分类上超越了仅使用LLM、少量样本学习(few-shot)和去偏基线,分别在这三个数据集上的平均召回率(mR@50)达到了15.9、13.1和11.7。这些结果突显了以视觉为基础的LLM先验在可扩展开放世界视觉理解中的潜力。
https://arxiv.org/abs/2506.05651
Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.
受大型语言模型(LLM)的上下文学习机制启发,一种新的基于视觉提示的一般化图像编辑范式正在兴起。现有的单参考方法通常集中于风格或外观调整,并且在处理非刚性变换时面临挑战。为了解决这些限制,我们提出利用源目标图像对来提取并转移内容感知编辑意图到新的查询图像上。为此,我们引入了 RelationAdapter,这是一个轻量级模块,使基于扩散变压器(DiT)的模型能够从少量示例中有效捕捉和应用视觉变换。此外,我们还推出了包含218种多样化编辑任务的Relation252K数据集,用于评估在视觉提示驱动场景中的模型泛化能力和适应性。 在 Relation252K 上进行的实验表明,RelationAdapter 显著提高了模型理解并转移编辑意图的能力,从而在生成质量和总体编辑性能方面取得了显著提升。
https://arxiv.org/abs/2506.02528
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
我们介绍了OvSGTR,这是一种基于变压器的全新框架,用于全开放式词汇量场景图生成,克服了传统封闭集模型的局限性。传统的方法将物体和关系识别限制在一个固定的词汇表中,这妨碍了它们在新概念频繁出现的真实世界场景中的应用。相比之下,我们的方法同时预测超出预定义类别的对象(节点)及其相互关系(边)。OvSGTR采用类似于DETR的架构,包括冻结的图像骨干网络和文本编码器来提取高质量的视觉和语义特征,并通过变压器解码器融合这些特征以进行端到端场景图预测。为了丰富模型对复杂视觉关系的理解,我们提出了一种基于关系感知的预训练策略,在弱监督下综合生成场景图注释。具体而言,我们研究了三种管道——基于场景解析器、基于大型语言模型(LLM)和多模态LLM的方法——以利用最少的手动标注生成可转移的监督信号。此外,为了解决开放式词汇设置中常见的灾难性遗忘问题,我们引入了一种结合视觉概念保留机制与知识蒸馏策略的方法,在微调过程中确保模型保持丰富的语义线索。在VG150基准测试上的广泛实验表明,OvSGTR在封闭集、基于开放词汇对象检测的、关系驱动型和完全开放式词汇量等多种设置下均达到了最先进的性能水平。我们的结果强调了大规模关系感知预训练和变压器架构对于推进场景图生成向更通用和可靠视觉理解方向发展的前景。
https://arxiv.org/abs/2505.20106
Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance.
开放词汇视频视觉关系检测的目标是在不受预定义对象或关系类别限制的情况下,识别视频中的物体及其关系。现有方法利用如CLIP等预训练的视觉-语言模型丰富的语义知识来识别新型类别。然而,这些方法通常采用级联流水线,先检测出物体再基于这些物体分类它们之间的关系,这种做法可能导致错误传播并导致性能欠佳。 本文中我们提出了互增强对象和关系框架(Mutual EnhancemenT of Objects and Relationships, METOR),这是一个以查询为基础的统一框架,在开放词汇场景下同时建模和相互增强目标检测与关系分类。在该框架内,首先设计了一个基于CLIP的上下文细化编码模块,用于提取物体和关系的视觉背景,并以此来改进文本特征和对象查询的编码,从而提高对新类别泛化的编码能力。然后提出了一种迭代增强模块,通过充分利用其相互依赖性交替地增强对象与关系的表现形式,以提升识别性能。 在两个公开数据集VidVRD和VidOR上进行广泛实验后证明了该框架实现了当前最优的性能。
https://arxiv.org/abs/2505.06663
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.
最近在多模态大型语言模型(MLLMs)方面的进展显著提高了对象级定位和区域描述的能力,但在视觉关系理解方面仍存在局限性(例如,场景图生成),特别是在建模涉及事件中多个实体的多重语义角色的\textit{N}-元关系时。这种对多实体之间缺乏\textit{语义依赖}模型导致了不可靠的结果,加剧了MLLMs的幻想倾向和过度依赖语言先验知识。 为此,我们提出了Relation-R1,这是第一个统一的关系理解框架,它明确地在强化学习(RL)范式内整合了认知链式思考(CoT)引导的监督微调(SFT)和组相对策略优化(GRPO)。具体来说,我们首先通过SFT建立了基础推理能力,并强制执行结构化的输出和思维过程。然后利用GRPO通过多奖励优化来精炼这些输出,优先考虑视觉语义定位而非语言诱导偏差,从而提高泛化能力。 在广泛使用的PSG和SWiG数据集上的大量实验表明,Relation-R1在二元关系理解和\textit{N}-元关系理解方面都达到了最先进的性能。
https://arxiv.org/abs/2504.14642
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)的目标是识别图像中对象对之间的关系(或互动)。尽管最近的VRD模型已经取得了令人印象深刻的表现,但它们都局限于预定义的关系类别,并未能考虑视觉关系所具有的语义模糊性特征。与物体不同,视觉关系的外观总是微妙的,可以从不同的视角用多个谓词单词来描述,例如,“骑”可以分别从体育和空间位置的角度描绘为“比赛”和“坐在上面”。为此,我们提出将视觉关系建模为连续嵌入,并设计扩散模型以在条件生成方式下实现泛化的VRD,命名为Diff-VRD。我们在隐式空间中对扩散过程进行建模,并生成图像中的所有可能的关系作为嵌入序列。在生成过程中,主体-客体对的视觉和文本嵌入充当条件信号并通过交叉注意力机制注入其中。生成之后,我们设计了一个后续匹配阶段以根据它们的语义相似性将关系词分配给主体-客体对。得益于基于扩散的生成过程,我们的Diff-VRD能够生成超出数据集预定义类别标签的视觉关系。为了适当评估这项泛化的VRD任务,我们引入了两个评价指标,即文本到图像检索和灵感来自图像字幕的SPICE PR曲线。在人类对象交互(HOI)检测和场景图生成(SGG)基准测试中的广泛实验证明了Diff-VRD的优越性和有效性。
https://arxiv.org/abs/2504.12100
The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.
音频驱动的一次性面部动画(ADOS-THA)面临的最大挑战在于捕捉相邻视频帧之间细微的变化。本质上,连续音频片段之间的时序关系与相应的连续视频帧的时序关系高度相关,提供了可以对头部动作动画进行引导和监督的重要补充信息。在本研究中,我们提出了一种学习音视频关联,并通过一种新颖的时序音视频关联嵌入(TAVCE)框架将这些关联整合起来以增强特征表示并规范化最终生成的方法。 具体而言,该方法首先学习一个音频-视觉时间相关性度量,确保连续音频片段之间的时序关系与相应连续视频帧之间的时序关系对齐。由于时序音频关系包含了有关视觉帧的对准信息,我们首先通过一种简单但有效的通道注意力机制将其整合进来,以指导学习更具代表性的特征。在训练过程中,我们也利用这些对齐的相关性作为额外目标来监督生成视觉帧。 我们在几个公开的数据集(即HDTF、LRW、VoxCeleb1和VoxCeleb2)上进行了广泛的实验,证明了我们提出的方法优于现有的领先算法。
https://arxiv.org/abs/2504.05746
Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.
视频问答(Video-Question-Answering,简称 VideoQA)涉及到捕捉随时间变化的复杂视觉关系,这对先进的视频语言模型(VLMs)来说仍是一个挑战。这种困难部分源于需要将视觉内容表示为适合这些模型处理的合理大小输入的问题。为了应对这一问题,我们提出了基于关系的视频表征学习框架(RElation-based Video rEpresentAtion Learning, REVEAL)。该框架旨在通过编码结构化、分解式的表征来捕捉视觉关系信息。 具体来说,受时空场景图的启发,我们提出将视频序列在随时间变化的过程中以语言嵌入的形式表示为一组关系三元组(即形式为“主体-谓词-客体”的关系)。为此,我们从视频字幕中提取明确的关系,并结合Many-to-Many噪声对比估计(MM-NCE)和Q-Former架构来对齐无序的视频衍生查询集与相应的文本基础关系描述。在推理阶段,生成的Q-former能够产生一个高效的令牌表示形式,可以作为输入提供给VLM进行VideoQA任务。 我们在五个具有挑战性的基准测试(NeXT-QA、Intent-QA、STAR、VLEP和TVQA)上评估了该框架的表现。结果显示,基于查询的视频表征在与全局对齐基础CLF或patch令牌表示相比时能够胜出,并且在需要时间推理和关系理解的任务中,其表现可以匹敌当前最先进的模型。代码和模型将公开发布。
https://arxiv.org/abs/2504.05463
Flexible objects recognition remains a significant challenge due to its inherently diverse shapes and sizes, translucent attributes, and subtle inter-class differences. Graph-based models, such as graph convolution networks and graph vision models, are promising in flexible objects recognition due to their ability of capturing variable relations within the flexible objects. These methods, however, often focus on global visual relationships or fail to align semantic and visual information. To alleviate these limitations, we propose a semantic-enhanced heterogeneous graph learning method. First, an adaptive scanning module is employed to extract discriminative semantic context, facilitating the matching of flexible objects with varying shapes and sizes while aligning semantic and visual nodes to enhance cross-modal feature correlation. Second, a heterogeneous graph generation module aggregates global visual and local semantic node features, improving the recognition of flexible objects. Additionally, We introduce the FSCW, a large-scale flexible dataset curated from existing sources. We validate our method through extensive experiments on flexible datasets (FDA and FSCW), and challenge benchmarks (CIFAR-100 and ImageNet-Hard), demonstrating competitive performance.
灵活物体识别仍然面临重大挑战,原因在于其形状和大小的多样性、半透明属性以及细微的类别间差异。基于图的方法(如图卷积网络和图形视觉模型)在灵活物体识别中展现出巨大潜力,因为它们能够捕捉到灵活物体内部可变的关系。然而,这些方法通常侧重于全局视觉关系或无法对齐语义与视觉信息。为了解决这些问题,我们提出了一种增强的异构图学习方法。 首先,采用了一个自适应扫描模块来提取具有判别性的语义上下文,这有助于匹配形状和大小各异的灵活物体,并通过对齐语义和视觉节点来加强跨模态特征的相关性。其次,通过一个异构图生成模块聚合全局视觉与局部语义节点特征,从而提高对灵活物体的识别能力。 此外,我们引入了一个名为FSCW的大规模灵活数据集,该数据集是从现有来源汇编而来的。我们在灵活数据集(FDA和FSCW)以及具有挑战性的基准测试(CIFAR-100和ImageNet-Hard)上通过广泛的实验验证了我们的方法的有效性,并展示了其在性能上的竞争力。
https://arxiv.org/abs/2503.22079
We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
我们提出了一种新的设置,称为编辑转移(Edit Transfer),在这种设置中,模型通过单一的源-目标示例学习转换,并将其应用到一个新的查询图像上。尽管基于文本的方法在语义操作方面表现出色,但它们往往难以处理精确的几何细节(如姿态和视角变化)。而参考基编辑方法通常侧重于风格或外观上的修改,在非刚性变换时表现不佳。通过从源-目标对中显式学习编辑转换,编辑转移减轻了仅基于文本和以外观为中心参考方法的局限性。 我们借鉴大型语言模型中的上下文学习概念,提出了一种视觉关系上下文学习范式,并在此基础上构建了一个DiT(Diffusion Model for Text-to-Image)为基础的文本到图像模型。我们将编辑示例与查询图像组合成一个统一的四面板复合图,然后采用轻量级LoRA微调方法来捕捉从少量样本中复杂的空间变换。尽管仅使用了42个训练样本,编辑转移在各种非刚性场景中的表现仍显著优于当前最先进的文本到图像(TIE)和参考基图像编辑(RIE)方法,证明了少样本视觉关系学习的有效性。
https://arxiv.org/abs/2503.13327
The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
视频视觉关系检测(VidVRD)任务旨在识别视频中物体及其之间的关系,这一任务由于动态内容、高昂的标注成本以及长尾分布的关系类型而极具挑战性。视觉语言模型(VLMs)有助于探索开放词汇表式的视觉关系检测任务,但往往忽视了不同视觉区域间及它们之间关系的关联性。此外,直接使用VLM来识别视频中的视觉关系也会因为图像与视频之间的巨大差异而带来显著挑战。 因此,我们提出了一种新颖的开放式视频视觉关系检测框架——OpenVidVRD,通过提示学习将VLM的知识和能力迁移到改进VidVRD任务上。具体来说,我们利用VLM从基于视频区域自动生成的区域描述中提取文本表示。接下来,开发了一个时空细化模块,通过整合跨模态时空互补信息来推导视频中的物体级关系表示。此外,采用一种提示驱动策略以对齐语义空间,以此充分利用VLM的语义理解能力,提高OpenVidVRD的整体泛化能力。 在VidVRD和VidOR公开数据集上进行的广泛实验表明,所提出的模型优于现有的方法。
https://arxiv.org/abs/2503.09416
Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.
人类可以根据任务复杂度灵活地在不同的思维方式之间切换:从快速直观判断到深入的分析理解。然而,目前基于自然语言指令定位界面元素的图形用户界面(GUI)接地系统仅依赖于即时预测而缺乏推理能力,难以理解和处理包含嵌套结构和层级关系的复杂界面布局,从而限制了其在复杂接口中的有效性。 受人类双系统认知启发,我们提出了Focus框架,这是一种结合快速预测与系统分析的新颖GUI接地方法。通过根据任务复杂度动态切换快慢两种处理方式,该框架能够在提高效率的同时优化准确性。Focus将接地过程分解为逐步阶段:界面概要、视觉集中分析和精确坐标预测。这种结构化分解使全面理解和解析界面布局及视觉关系成为可能。 广泛的实验表明,使用仅30万条训练数据的2B参数模型,Focus在性能上达到了最先进的水平,并且在复杂GUI场景中表现尤为突出,在ScreenSpot任务中的平均准确率为77.4%,而在更具挑战性的ScreenSpot-Pro任务中则为13.3%。我们的分析揭示了双系统方法的有效性,并展示了其改进复杂GUI交互场景的潜力。
https://arxiv.org/abs/2503.06470
Humans and other animals readily generalize abstract relations, such as recognizing constant in shape or color, whereas neural networks struggle. To investigate how neural networks generalize abstract relations, we introduce SimplifiedRPM, a novel benchmark for systematic evaluation. In parallel, we conduct human experiments to benchmark relational difficulty, enabling direct model-human comparisons. Testing four architectures--ResNet-50, Vision Transformer, Wild Relation Network, and Scattering Compositional Learner (SCL)--we find that SCL best aligns with human behavior and generalizes best. Building on a geometric theory of neural representations, we show representational geometries that predict generalization. Layer-wise analysis reveals distinct relational reasoning strategies across models and suggests a trade-off where unseen rule representations compress into training-shaped subspaces. Guided by our geometric perspective, we propose and evaluate SNRloss, a novel objective balancing representation geometry. Our findings offer geometric insights into how neural networks generalize abstract relations, paving the way for more human-like visual reasoning in AI.
人类和其他动物能够轻松地概括抽象关系,例如识别形状或颜色不变性,而神经网络在这方面却表现得较为困难。为了研究神经网络如何概括抽象关系,我们引入了SimplifiedRPM这一新的基准测试平台,用于系统的评估。同时,我们也进行了人类实验以衡量关系的难度,并直接比较模型与人的行为。 通过对四种架构——ResNet-50、Vision Transformer、Wild Relation Network 和 Scattering Compositional Learner(SCL)进行测试,我们发现SCL在与人类行为对齐以及泛化能力方面表现最佳。基于神经表示的几何理论,我们展示了能够预测概括效果的表示几何结构,并且通过对各个层级的分析揭示了不同模型之间迥异的关系推理策略,同时表明未见过的规则表示会在训练形状子空间中压缩。 根据我们的几何视角,我们提出了并评估了一个新的目标函数SNRloss,它平衡了表示几何。这一发现为神经网络如何概括抽象关系提供了几何学见解,并为进一步在AI领域实现更接近人类视觉推理的能力开辟了道路。
https://arxiv.org/abs/2502.17382
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from the video frames. Researchers working on AVS suffer from limited datasets because hand-crafted annotation is expensive. Recent works attempt to overcome the challenge of limited data by leveraging the segmentation foundation model, SAM, prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing pre-trained knowledge of SAM, it does not address the fundamental challenge of the limited dataset for learning audio-visual relationships. To address these limitations, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Experiments on the AVSBench dataset demonstrate state-of-the-art performance on both datasets of AVSBench. Our approach outperforms existing methods by effectively utilizing pretrained segmentation models and cross-modal semantic alignment.
音频-视频分割(AVS)的目标是从视频帧中定位和分割出发出声音的物体。从事这项研究的研究人员面临着数据集有限的问题,因为手工标注的成本高昂。近期的工作试图通过利用预训练的基础模型SAM并使用音频提示来增强其对发声源物体进行分割的能力,从而克服数据不足的挑战。尽管这种策略通过利用预先训练的知识减轻了模型理解视觉模式的压力,但它并未解决学习音视频关系的数据集有限的根本问题。 为了解决这些限制,我们提出了一种新的框架**AV2T-SAM**,该框架将音频特征与预训练的文本提示SAM中的文本嵌入空间相连接。我们的方法利用从丰富的图文配对数据集中学到的多模态对应来增强音视频的一致性。此外,我们引入了一个新特性$\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$,强调音频和视觉模式之间的共享语义,并过滤掉不相关的噪声。 在AVSBench数据集上的实验表明,我们的方法在这两个数据集上均达到了最先进的性能。通过有效利用预训练的分割模型和跨模态语义对齐,我们的方法超越了现有技术。
https://arxiv.org/abs/2502.16359
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
视觉推理指的是解决关于视觉信息的问题。目前的视觉推理方法通常采用预训练的视觉-语言模型(VLM)策略或深度神经网络方法。然而,现有努力受到解释能力有限的限制,并且在问题文本中存在欠规范化的现象,这进一步阻碍了进展。此外,缺乏细粒度的视觉知识也限制了对视觉推理任务中主体行为的精确理解。为了解决这些问题,我们提出了VIKSER(基于视觉知识的自我强化推理框架)。具体而言,VIKSER通过使用从大型语言模型中提炼的知识进行训练,并借助视觉关系检测技术提取细粒度的视觉知识。随后,VIKSER利用这些细粒度的视觉知识对欠规范化的提问进行改写。此外,我们设计了一种新的提示方法叫做证据链(CoE),该方法通过发挥“用于推理的证据”的作用赋予VIKSER可解释的推理能力。同时,自我反思技术的集成使VIKSER能够从错误中学习并提高性能。在广泛使用的数据集上进行的实验表明,VIKSER在相关任务中达到了新的最先进的(SOTA)结果。
https://arxiv.org/abs/2502.00711
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
开放词汇场景图生成(OV-SGG)通过将视觉关系表示与开放式词汇文本表示对齐,克服了闭集假设的限制。这使得识别新颖的视觉关系成为可能,并使其适用于具有多样化关系的真实世界场景。然而,现有的OV-SGG方法受到固定文本表示方式的约束,这对图像和文本之间的多样性和准确性匹配造成了限制。 为了应对这些挑战,我们提出了基于关系感知分层提示(RAHP)框架,该框架通过整合主体-客体及区域特定的关系信息来增强文本表示。我们的方法利用实体聚类来处理关系三元组类别复杂性的问题,并有效地集成主体-客体信息。此外,我们还使用大型语言模型(LLM)生成详细且区域感知的提示词,捕捉细微的视觉互动并提高视觉与文本模态之间的对齐度。 RAHP框架还引入了在视觉-语言模型(VLMs)中的动态选择机制,该机制根据视觉内容自适应地选择相关文本提示,从而减少了无关提示造成的噪声干扰。我们在Visual Genome和Open Images v6数据集上的广泛实验表明,我们的框架能够持续达到最先进的性能水平,证明了它在解决开放词汇场景图生成挑战方面的有效性。 综上所述,通过提出基于关系感知分层提示(RAHP)的框架,我们不仅增强了文本表示的能力,还提高了视觉-语言模型在处理复杂、多样化场景时的表现,从而为开放词汇场景图生成提供了一种有效的解决方案。
https://arxiv.org/abs/2412.19021
In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, {untrimmed} videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel \ul{CC}Net, comprising two core modules: the Cross-Modal Consistency \ul{C}ollaboration (CMCC) and the Multi-Temporal Granularity \ul{C}ollaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at \url{this https URL}.
在视听学习领域,大多数研究任务仅关注短片。本文聚焦于更为实用的密集视听事件定位(DAVEL)任务,推动了对更长、未剪辑视频的视听场景理解的发展。该任务旨在同时识别并精确定位音频和视频流中所有同时发生的事件。通常情况下,每个视频包含多个类别的密集事件,并且这些事件可能在时间线上重叠,各自持续的时间也不同。鉴于这些挑战,有效地利用跨模态关系以及以各种粒度编码的时间特征变得至关重要。为了解决这些问题,我们引入了一种新的\ul{CC}Net,它包括两个核心模块:跨模态一致性协作(CMCC)和多时间粒度协作(MTGC)。具体而言,CMCC模块包含两条分支:一条跨模态交互分支和一条时间一致性门控分支。前一支路通过编码视听关系促进跨模态一致事件语义的聚合,而后一支路引导一种模式聚焦于在另一种模式中识别到的关键事件相关时间区域。MTGC模块包括一个粗至细协作块和一个细至粗协作块,为粗粒度和细粒度的时间特征提供双向支持。我们在UnAV-100数据集上的广泛实验验证了我们的模块设计的有效性,并取得了密集视听事件定位的新最佳性能。代码可在\url{此 https URL}获取。
https://arxiv.org/abs/2412.12628
Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks, Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation, designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.
时空场景图(STSGs)通过建模对象及其随时间演变的关系,为动态场景提供了简洁且表达力强的表示。然而,现实世界的视觉关系通常表现出长尾分布,导致现有的视频场景图生成(VidSGG)和场景图预期(SGA)任务方法产生有偏见的场景图。为此,我们提出了ImparTail,这是一个利用课程学习和损失掩码来减轻STSG生成和预期中偏差的新颖训练框架。我们的方法在训练过程中逐步减少头部关系类别的主导地位,并更多地关注尾部类别,从而实现更平衡的训练。此外,我们引入了两个新任务:鲁棒时空场景图生成和鲁棒场景图预期,旨在评估STSG模型对分布变化的鲁棒性。Action Genome数据集上的大量实验表明,与现有方法相比,我们的框架显著提升了STSG模型的无偏性能和鲁棒性。
https://arxiv.org/abs/2411.13059