We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
我们提出了一种新的设置,称为编辑转移(Edit Transfer),在这种设置中,模型通过单一的源-目标示例学习转换,并将其应用到一个新的查询图像上。尽管基于文本的方法在语义操作方面表现出色,但它们往往难以处理精确的几何细节(如姿态和视角变化)。而参考基编辑方法通常侧重于风格或外观上的修改,在非刚性变换时表现不佳。通过从源-目标对中显式学习编辑转换,编辑转移减轻了仅基于文本和以外观为中心参考方法的局限性。 我们借鉴大型语言模型中的上下文学习概念,提出了一种视觉关系上下文学习范式,并在此基础上构建了一个DiT(Diffusion Model for Text-to-Image)为基础的文本到图像模型。我们将编辑示例与查询图像组合成一个统一的四面板复合图,然后采用轻量级LoRA微调方法来捕捉从少量样本中复杂的空间变换。尽管仅使用了42个训练样本,编辑转移在各种非刚性场景中的表现仍显著优于当前最先进的文本到图像(TIE)和参考基图像编辑(RIE)方法,证明了少样本视觉关系学习的有效性。
https://arxiv.org/abs/2503.13327
The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
视频视觉关系检测(VidVRD)任务旨在识别视频中物体及其之间的关系,这一任务由于动态内容、高昂的标注成本以及长尾分布的关系类型而极具挑战性。视觉语言模型(VLMs)有助于探索开放词汇表式的视觉关系检测任务,但往往忽视了不同视觉区域间及它们之间关系的关联性。此外,直接使用VLM来识别视频中的视觉关系也会因为图像与视频之间的巨大差异而带来显著挑战。 因此,我们提出了一种新颖的开放式视频视觉关系检测框架——OpenVidVRD,通过提示学习将VLM的知识和能力迁移到改进VidVRD任务上。具体来说,我们利用VLM从基于视频区域自动生成的区域描述中提取文本表示。接下来,开发了一个时空细化模块,通过整合跨模态时空互补信息来推导视频中的物体级关系表示。此外,采用一种提示驱动策略以对齐语义空间,以此充分利用VLM的语义理解能力,提高OpenVidVRD的整体泛化能力。 在VidVRD和VidOR公开数据集上进行的广泛实验表明,所提出的模型优于现有的方法。
https://arxiv.org/abs/2503.09416
Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.
人类可以根据任务复杂度灵活地在不同的思维方式之间切换:从快速直观判断到深入的分析理解。然而,目前基于自然语言指令定位界面元素的图形用户界面(GUI)接地系统仅依赖于即时预测而缺乏推理能力,难以理解和处理包含嵌套结构和层级关系的复杂界面布局,从而限制了其在复杂接口中的有效性。 受人类双系统认知启发,我们提出了Focus框架,这是一种结合快速预测与系统分析的新颖GUI接地方法。通过根据任务复杂度动态切换快慢两种处理方式,该框架能够在提高效率的同时优化准确性。Focus将接地过程分解为逐步阶段:界面概要、视觉集中分析和精确坐标预测。这种结构化分解使全面理解和解析界面布局及视觉关系成为可能。 广泛的实验表明,使用仅30万条训练数据的2B参数模型,Focus在性能上达到了最先进的水平,并且在复杂GUI场景中表现尤为突出,在ScreenSpot任务中的平均准确率为77.4%,而在更具挑战性的ScreenSpot-Pro任务中则为13.3%。我们的分析揭示了双系统方法的有效性,并展示了其改进复杂GUI交互场景的潜力。
https://arxiv.org/abs/2503.06470
Humans and other animals readily generalize abstract relations, such as recognizing constant in shape or color, whereas neural networks struggle. To investigate how neural networks generalize abstract relations, we introduce SimplifiedRPM, a novel benchmark for systematic evaluation. In parallel, we conduct human experiments to benchmark relational difficulty, enabling direct model-human comparisons. Testing four architectures--ResNet-50, Vision Transformer, Wild Relation Network, and Scattering Compositional Learner (SCL)--we find that SCL best aligns with human behavior and generalizes best. Building on a geometric theory of neural representations, we show representational geometries that predict generalization. Layer-wise analysis reveals distinct relational reasoning strategies across models and suggests a trade-off where unseen rule representations compress into training-shaped subspaces. Guided by our geometric perspective, we propose and evaluate SNRloss, a novel objective balancing representation geometry. Our findings offer geometric insights into how neural networks generalize abstract relations, paving the way for more human-like visual reasoning in AI.
人类和其他动物能够轻松地概括抽象关系,例如识别形状或颜色不变性,而神经网络在这方面却表现得较为困难。为了研究神经网络如何概括抽象关系,我们引入了SimplifiedRPM这一新的基准测试平台,用于系统的评估。同时,我们也进行了人类实验以衡量关系的难度,并直接比较模型与人的行为。 通过对四种架构——ResNet-50、Vision Transformer、Wild Relation Network 和 Scattering Compositional Learner(SCL)进行测试,我们发现SCL在与人类行为对齐以及泛化能力方面表现最佳。基于神经表示的几何理论,我们展示了能够预测概括效果的表示几何结构,并且通过对各个层级的分析揭示了不同模型之间迥异的关系推理策略,同时表明未见过的规则表示会在训练形状子空间中压缩。 根据我们的几何视角,我们提出了并评估了一个新的目标函数SNRloss,它平衡了表示几何。这一发现为神经网络如何概括抽象关系提供了几何学见解,并为进一步在AI领域实现更接近人类视觉推理的能力开辟了道路。
https://arxiv.org/abs/2502.17382
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from the video frames. Researchers working on AVS suffer from limited datasets because hand-crafted annotation is expensive. Recent works attempt to overcome the challenge of limited data by leveraging the segmentation foundation model, SAM, prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing pre-trained knowledge of SAM, it does not address the fundamental challenge of the limited dataset for learning audio-visual relationships. To address these limitations, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Experiments on the AVSBench dataset demonstrate state-of-the-art performance on both datasets of AVSBench. Our approach outperforms existing methods by effectively utilizing pretrained segmentation models and cross-modal semantic alignment.
音频-视频分割(AVS)的目标是从视频帧中定位和分割出发出声音的物体。从事这项研究的研究人员面临着数据集有限的问题,因为手工标注的成本高昂。近期的工作试图通过利用预训练的基础模型SAM并使用音频提示来增强其对发声源物体进行分割的能力,从而克服数据不足的挑战。尽管这种策略通过利用预先训练的知识减轻了模型理解视觉模式的压力,但它并未解决学习音视频关系的数据集有限的根本问题。 为了解决这些限制,我们提出了一种新的框架**AV2T-SAM**,该框架将音频特征与预训练的文本提示SAM中的文本嵌入空间相连接。我们的方法利用从丰富的图文配对数据集中学到的多模态对应来增强音视频的一致性。此外,我们引入了一个新特性$\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$,强调音频和视觉模式之间的共享语义,并过滤掉不相关的噪声。 在AVSBench数据集上的实验表明,我们的方法在这两个数据集上均达到了最先进的性能。通过有效利用预训练的分割模型和跨模态语义对齐,我们的方法超越了现有技术。
https://arxiv.org/abs/2502.16359
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
视觉推理指的是解决关于视觉信息的问题。目前的视觉推理方法通常采用预训练的视觉-语言模型(VLM)策略或深度神经网络方法。然而,现有努力受到解释能力有限的限制,并且在问题文本中存在欠规范化的现象,这进一步阻碍了进展。此外,缺乏细粒度的视觉知识也限制了对视觉推理任务中主体行为的精确理解。为了解决这些问题,我们提出了VIKSER(基于视觉知识的自我强化推理框架)。具体而言,VIKSER通过使用从大型语言模型中提炼的知识进行训练,并借助视觉关系检测技术提取细粒度的视觉知识。随后,VIKSER利用这些细粒度的视觉知识对欠规范化的提问进行改写。此外,我们设计了一种新的提示方法叫做证据链(CoE),该方法通过发挥“用于推理的证据”的作用赋予VIKSER可解释的推理能力。同时,自我反思技术的集成使VIKSER能够从错误中学习并提高性能。在广泛使用的数据集上进行的实验表明,VIKSER在相关任务中达到了新的最先进的(SOTA)结果。
https://arxiv.org/abs/2502.00711
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
开放词汇场景图生成(OV-SGG)通过将视觉关系表示与开放式词汇文本表示对齐,克服了闭集假设的限制。这使得识别新颖的视觉关系成为可能,并使其适用于具有多样化关系的真实世界场景。然而,现有的OV-SGG方法受到固定文本表示方式的约束,这对图像和文本之间的多样性和准确性匹配造成了限制。 为了应对这些挑战,我们提出了基于关系感知分层提示(RAHP)框架,该框架通过整合主体-客体及区域特定的关系信息来增强文本表示。我们的方法利用实体聚类来处理关系三元组类别复杂性的问题,并有效地集成主体-客体信息。此外,我们还使用大型语言模型(LLM)生成详细且区域感知的提示词,捕捉细微的视觉互动并提高视觉与文本模态之间的对齐度。 RAHP框架还引入了在视觉-语言模型(VLMs)中的动态选择机制,该机制根据视觉内容自适应地选择相关文本提示,从而减少了无关提示造成的噪声干扰。我们在Visual Genome和Open Images v6数据集上的广泛实验表明,我们的框架能够持续达到最先进的性能水平,证明了它在解决开放词汇场景图生成挑战方面的有效性。 综上所述,通过提出基于关系感知分层提示(RAHP)的框架,我们不仅增强了文本表示的能力,还提高了视觉-语言模型在处理复杂、多样化场景时的表现,从而为开放词汇场景图生成提供了一种有效的解决方案。
https://arxiv.org/abs/2412.19021
In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, {untrimmed} videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel \ul{CC}Net, comprising two core modules: the Cross-Modal Consistency \ul{C}ollaboration (CMCC) and the Multi-Temporal Granularity \ul{C}ollaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at \url{this https URL}.
在视听学习领域,大多数研究任务仅关注短片。本文聚焦于更为实用的密集视听事件定位(DAVEL)任务,推动了对更长、未剪辑视频的视听场景理解的发展。该任务旨在同时识别并精确定位音频和视频流中所有同时发生的事件。通常情况下,每个视频包含多个类别的密集事件,并且这些事件可能在时间线上重叠,各自持续的时间也不同。鉴于这些挑战,有效地利用跨模态关系以及以各种粒度编码的时间特征变得至关重要。为了解决这些问题,我们引入了一种新的\ul{CC}Net,它包括两个核心模块:跨模态一致性协作(CMCC)和多时间粒度协作(MTGC)。具体而言,CMCC模块包含两条分支:一条跨模态交互分支和一条时间一致性门控分支。前一支路通过编码视听关系促进跨模态一致事件语义的聚合,而后一支路引导一种模式聚焦于在另一种模式中识别到的关键事件相关时间区域。MTGC模块包括一个粗至细协作块和一个细至粗协作块,为粗粒度和细粒度的时间特征提供双向支持。我们在UnAV-100数据集上的广泛实验验证了我们的模块设计的有效性,并取得了密集视听事件定位的新最佳性能。代码可在\url{此 https URL}获取。
https://arxiv.org/abs/2412.12628
Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks, Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation, designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.
时空场景图(STSGs)通过建模对象及其随时间演变的关系,为动态场景提供了简洁且表达力强的表示。然而,现实世界的视觉关系通常表现出长尾分布,导致现有的视频场景图生成(VidSGG)和场景图预期(SGA)任务方法产生有偏见的场景图。为此,我们提出了ImparTail,这是一个利用课程学习和损失掩码来减轻STSG生成和预期中偏差的新颖训练框架。我们的方法在训练过程中逐步减少头部关系类别的主导地位,并更多地关注尾部类别,从而实现更平衡的训练。此外,我们引入了两个新任务:鲁棒时空场景图生成和鲁棒场景图预期,旨在评估STSG模型对分布变化的鲁棒性。Action Genome数据集上的大量实验表明,与现有方法相比,我们的框架显著提升了STSG模型的无偏性能和鲁棒性。
https://arxiv.org/abs/2411.13059
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.
https://arxiv.org/abs/2410.15364
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
人类在理解视觉关系方面的能力远远优于AI系统,特别是对于之前未见过的物体。例如,AI系统在确定两个此类物体是否在视觉上相同或不同时会感到困惑,而人类则可以轻松地做到这一点。积极视觉理论认为,学习视觉关系是基于我们移动眼睛来固定物体及其部分的行为。特别是,关于相应眼动低维空间信息的假设,有助于促进不同图像部分之间的关系表示。受到这些理论的启发,我们开发了一种名为Glimpse-based Active Perception(GAP)的新系统,该系统在输入图像的最具突出性的区域进行序列性浏览,并对其进行高分辨率处理。重要的是,我们的系统利用浏览行动产生的位置以及它们周围的视觉内容来表示图像不同部分之间的关系。结果显示,GAP对于提取超越当前视觉内容的视觉关系至关重要。我们的方法在几个视觉推理任务上达到了最先进的性能,具有更高的样本效率,并且对分布不在前的模型的泛化更好。
https://arxiv.org/abs/2409.20213
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.
开放词汇视频视觉关系检测旨在将视频视觉关系检测扩展到注释类别的范围之外,通过检测视频中的未见关系来识别可见和未见对象之间的关系。现有的方法通常使用在关闭数据集上训练的轨迹检测器来检测物体轨迹,然后将轨迹输入到大规模预训练的视觉语言模型中,以实现开放词汇分类。然而,对预训练轨迹检测器的依赖使得它们无法扩展到新颖物体类别,导致性能下降。为了解决这个问题,我们提出将物体轨迹检测和关系分类统一成一个端到端的开放词汇框架。在这个框架中,我们提出了一个关系意识到的开放词汇轨迹检测器。它主要由一个基于查询的Transformer解码器组成,其中CLIP的视觉编码器在帧级别进行离心化以实现开放词汇物体检测,和一个轨迹关联器。为了在轨迹检测过程中利用关系上下文,我们在Transformer解码器中嵌入了一个关系查询,相应地,设计了一个辅助关系损失,以使解码器能够直观地感知物体之间的关系。此外,我们还提出了一个利用CLIP丰富语义知识来发现新关系的开放词汇关系分类器。为了使CLIP更好地适应关系分类,我们设计了一个多模态提示方法,该方法采用空间时间视觉提示进行视觉表示,并使用视觉指导语言提示进行语言输入。在两个公开数据集VidVRD和VidOR上的大量实验证明了我们框架的有效性。我们的框架还应用于更困难的跨数据集场景,以进一步证明其泛化能力。
https://arxiv.org/abs/2409.12499
Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at this https URL.
与现实场景中杂乱的交互对机器人代理来说,理解观察对象之间的复杂空间依赖关系以确定最优的抓取序列或有效的对象检索策略面临着挑战。现有的解决方案通常处理简化场景,并专注于在初始物体检测阶段预测成对物体关系,但往往忽视全局上下文或者在处理冗余或缺失的物体关系方面遇到困难。在这项工作中,我们提出了一个现代的视觉关系推理 grasp planning 的视角。我们引入了 D3GD,一种包含 97 个不同类别的 35 个物体的 bin 选择场景。此外,我们提出了 D3G,一种新的端到端 Transformer-based 依赖关系图生成模型,它同时检测物体并生成表示它们空间关系的邻接矩阵。为了识别标准指标的局限性,我们首次使用关系精度(RPN)对模型性能进行评估,进行了一项广泛的实验基准。所得到的结果使我们将其方法确定为这一任务的最新状态,为未来的机器人操作研究奠定了基础。我们公开发布了这段代码和数据集的 URL。
https://arxiv.org/abs/2409.02035
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
要准确理解工程图纸,有必要在图纸中建立图像与其描述表之间的对应关系。现有的文档理解方法主要关注文本作为主要模式,这并不适合包含大量图像信息的文档。在可视关系检测领域,任务的结构使其无法评估图纸中所有实体对之间的关系。为解决这个问题,我们提出了一个基于视觉关系的检测模型,名为ViRED,用于识别电气工程图纸中表与电路之间的关联。我们的模型主要由三个部分组成:一个视觉编码器、一个对象编码器和一个关系解码器。我们使用PyTorch实现ViRED,以评估其性能。为了验证ViRED的有效性,我们进行了一系列实验。实验结果表明,在我们的工程图纸数据集上,我们的方法在关系预测任务上的准确度达到了96%,标志着与现有方法相比取得了显著的改进。结果还显示,即使在一个工程图纸中有大量的物体,ViRED仍可以在快速速度下进行推理。
https://arxiv.org/abs/2409.00909
Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
利用音频和视频模态进行视频分类是一个具有挑战性的任务,因为现有的方法需要大型模型架构,导致高计算复杂度和资源需求。较小的架构则很难实现最优性能。在本文中,我们提出了Attend-Fusion,一种专为捕捉视频数据中复杂的音频-视觉关系而设计的音频-视觉(AV)融合方法。通过对具有挑战性的YouTube-8M数据集的广泛实验,我们证明了Attend-Fusion达到75.64\%的F1得分,仅使用72M参数,与具有类似性能的大型基线模型(如Fully-Connected Late Fusion,75.96\% F1 score,341M parameters)相当。Attend-Fusion在大型基线模型的同时减小了模型大小,凸显了其在模型复杂度方面的效率。我们的工作表明,Attend-Fusion模型能够有效地结合音频和视频信息进行视频分类,实现与显著减小模型大小相当的竞争性能。这种方法为在各种应用环境中部署高效的视频理解系统提供了新的可能性。
https://arxiv.org/abs/2408.14441
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{this https URL}{this https URL}}.
视频视觉关系检测(VidVRD)关注实体在视频中的交互,是深入了解视频场景的关键步骤,超出了基本的视觉任务。传统方法在面对其复杂性时,通常将任务分为两个部分:一个是确定关系类别的存在,另一个是确定它们的时域边界。这种划分忽略了这些元素之间的固有联系。为了识别跨越不同持续时间的关系实体对,我们提出了VrdONE,一种简洁而有效的单阶段模型。VrdONE结合了主题和对象的特征,将谓词检测转换为他们联合表示的1D实例分割。这个设置允许在一次性识别关系类别和生成二进制掩码的同时,消除需要提议生成或后处理等额外步骤的需求。VrdONE在各种帧之间的特征交互方面表现出色,能够捕捉到短暂的和持久的关系。此外,我们引入了主题-对象协同(SOS)模块,提高了主题和对象在结合前如何相互感知。VrdONE在VidOR基准和ImageNet-VidVRD上实现了最先进的性能,展示了其在不同时间尺度上分辨关系的卓越能力。代码可在此处获得:\textcolor[R{228,58,136}]{\href{this <https://this <https://this> URL>}{this <https://this> URL}}。
https://arxiv.org/abs/2408.09408
Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 $mAP$ on HICO-DET, +11.4 $Acc$ on VRD, +4.7 $mAP$ on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.
视觉关系理解已经在人机交互(HOI)检测、场景图生成(SGG)和引用关系(RR)任务中进行了单独研究。考虑到这些任务的复杂性和相互关联性,有必要提供一个灵活的框架,能够以一种集成的方式有效地解决这些任务。在本文中,我们提出了FleVRS,一种将上述三个方面无缝集成在标准和提示性视觉关系分割中的单一模型,并进一步具有适应新场景的开放词汇分割能力。FleVRS利用文本和图像模态之间的协同作用,将各种类型的关系从图像中 grounded,并使用视觉语言模型中的文本特征进行视觉概念理解。通过不同数据集的实证验证,我们的框架在标准、提示性和开放词汇任务中优于现有模型,例如+1.9 $mAP$在HICO-DET,+11.4 $Acc$在VRD,+4.7 $mAP$在未见过的HICO-DET上。FleVRS代表了一个明显朝着更直观、全面和可扩展的视觉关系理解方向迈出的重要一步。
https://arxiv.org/abs/2408.08305
Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by leveraging the relationships and shared cues between the audio-visual modalities. As a result, our method can provide accurate answers by effectively utilizing available information even when input modalities are missing. We believe our method holds potential applications not only in AVQA research but also in various multi-modal scenarios.
https://arxiv.org/abs/2407.16171
Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.
文本基于人员搜索(TBPS)是一个在研究社区中引起广泛关注的问题。任务是根据文本描述检索一或多个特定个人的图像。任务的多样性要求学习在共享潜在空间中连接文本和图像数据的表示。现有的TBPS系统面临着两个主要挑战。一个是由于文本描述的固有模糊和不精确性而产生的身份混淆噪声,它表明了视觉属性的描述如何通常与不同的人相关联;另一个是内 Identity Variations,它们都是那些例如姿态、照明等可以改变给定主题文本属性的视觉外观的细微差别。为了应对这些问题,本文提出了一种名为MARS( Mae-Attribute-Relation-Sensitive)的新TBPS架构,它通过引入两个关键组件来增强现有技术的水平:视觉重构损失和属性损失。前一个采用基于文本描述的随机遮罩自动编码器来重构图像补丁。这样做,模型被鼓励在潜在空间中学习更富有表现力的表示和文本-视觉关系。相反,属性损失平衡了不同属性的贡献,这些属性定义为形容词短语文本。这种损失确保了在人员检索过程中考虑到了每个属性。在三个常用的数据集(CUHK-PEDES,ICFG-PEDES和RSTPReid)上进行的大量实验报告显示,性能得到了提高,特别是平均精度(mAP)指标与现有技术的水平相比显著增益。
https://arxiv.org/abs/2407.04287
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
幻觉问题一直是现有大型视觉语言模型(LVLMs)中的一个普遍关注点。之前的努力主要集中在研究物体幻觉,通过引入物体检测器可以轻松缓解这种幻觉。然而,这些努力忽视了物体关系幻觉,这对于视觉理解是至关重要的。在这项工作中,我们引入了R-Bench,一种用于评估视觉关系幻觉的新基准。R-Bench包括针对关系和实例水平的图像级问题,这些问题关注关系的存在以及评估局部视觉理解。我们确定了三种导致幻觉的关系共现类型:关系关系、主体关系和关系物体。视觉指令调整数据集的长尾分布显著影响了LVLMs对视觉关系的理解。此外,我们的分析发现,当前的LVLMs往往忽视视觉内容,过于依赖常识知识的大型语言模型。他们还难以根据上下文信息进行空间关系推理。
https://arxiv.org/abs/2406.16449