Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
https://arxiv.org/abs/2501.09555
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
许多视觉-语言应用需要能够理解否定的模型,例如,在使用自然语言检索包含某些对象但不包含其他特定对象的图像时。尽管通过大规模训练提高了视觉-语言模型(VLMs)的能力,它们理解和处理否定信息的能力仍然没有得到充分研究。这项研究旨在回答以下问题:当前的VLMs在理解否定方面表现如何?为此,我们引入了NegBench,这是一个新的基准测试工具,用于评估18种任务变体和涵盖图像、视频及医学数据集的79,000个样本中的否定理解能力。该基准由两个核心任务组成,旨在评估多样化多模态环境下的否定理解:带有否定的检索以及带有所述否定描述的多项选择题。 我们的评估结果显示,现代VLM在处理否定时存在显著困难,其性能常常处于随机水平。为解决这些不足之处,我们探索了一种数据为中心的方法,即对包含数百万条否定描述的大规模合成数据集进行微调CLIP模型训练。这种方法可以在带有否定查询的检索中将召回率提高10%,在带有所述否定描述的多项选择题中则可使准确度提升40%。
https://arxiv.org/abs/2501.09425
Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at this https URL.
自动音频字幕生成是为音频内容生成文本描述的任务,最近的研究探讨了使用视觉信息来提升字幕的质量。然而,当前的方法往往无法有效地融合音频和视觉数据,从而忽视了来自每种模态的重要语义线索。为了应对这一挑战,我们提出了LAVCap——一种基于大型语言模型(LLM)的音视频字幕生成框架,该框架能够有效结合视觉信息与音频,以提升音频字幕的质量。 LAVCap采用了一种基于最优传输的对齐损失函数,用于弥合音频和视觉特征之间的模态差异,从而实现更有效的语义提取。此外,我们还提出了一种最优传输注意力模块,利用最优传输分配图来增强音视频融合的效果。结合最佳训练策略,实验结果表明我们的框架中的每一部分都是有效的。 LAVCap在AudioCaps数据集上的表现优于现有最先进的方法,并且无需依赖大规模的数据集或后处理步骤。代码可在提供的链接中获取:[此URL](https://此URL)(请将"this https URL"替换为实际的URL)。
https://arxiv.org/abs/2501.09291
Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
零样本识别模型需要大量的训练数据来实现泛化能力。然而,在零样本三维分类中,收集三维数据和描述信息的成本高昂且费时,相较于二维视觉任务而言这是一个显著的障碍。近期在生成模型方面的进展实现了前所未有的合成数据的真实感,最近的研究表明可以利用这些生成的数据作为训练数据使用。这自然引出了一个问题:通过生成模型创建的合成三维数据能否用于扩展有限的三维数据集? 为此,我们提出了一种基于文本引导几何增强(Text-guided Geometric Augmentation, TeGA)的方法来扩充有限的3D数据集。TeGA专门针对语言-图像-3D预训练设计,在零样本3D分类中达到了最先进的性能,并利用生成的文本到三维模型来提升和扩展受限的3D数据集。 具体而言,我们自动根据描述生成合成的3D数据,并引入了一种一致性过滤策略以剔除语义或几何形状与文本不匹配的噪声样本。在使用TeGA将原始数据集大小翻倍的实验中,我们的方法相较于基准线表现出显著改进,在Objaverse-LVIS上实现了零样本性能提升3.0%,在ScanObjectNN上为4.6%,而在ModelNet40上则达到了8.7%。 这些结果表明,TeGA有效地填补了三维数据的缺口,并且即使是在有限的真实训练数据的情况下,也能实现稳健的零样本三维分类。这为进一步开展零样本三维视觉应用铺平了道路。
https://arxiv.org/abs/2501.09278
Large-scale text-to-image (T2I) diffusion models have demonstrated an outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this work, we present ObjectDiffusion, a model that takes inspirations from the top cutting-edge image generative frameworks to seamlessly condition T2I models with new bounding boxes capabilities. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pretraining parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in all of the three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple objects of different sizes and locations.
大规模的文本到图像(T2I)扩散模型在从自然语言文字描述中生成多样且高质量视觉效果方面表现出卓越性能。已经开发出多种布局到图像的模型,利用包括分割图、边缘和人体关键点在内的广泛布局来控制生成过程。在这项工作中,我们提出了ObjectDiffusion模型,该模型借鉴了顶尖的图像生成框架,以无缝地将新的边界框功能整合进T2I模型中进行条件处理。具体来说,我们在ControlNet引入的网络架构基础上进行了重大修改,并将其与GLIGEN提出的条件处理和注入技术相结合。ObjectDiffusion使用从大规模数据集训练中获得的知识预训练参数来初始化自身。我们对ObjectDiffusion在COCO2017训练数据集上进行微调,并在其验证数据集上进行评估。我们的模型在AP$_{50}$、AR以及FID这三个指标上分别达到了46.6、44.5和19.8,超越了当前开源数据集训练的最先进(SOTA)模型的所有性能指标。ObjectDiffusion展示了生成多样且高质量、高保真的图像的独特能力,这些图像是根据语义和空间控制布局无缝形成的。在定性和定量测试中,在封闭集合和开放集合设置以及各种上下文背景下,ObjectDiffusion展示出了显著的定位能力。定性评估验证了ObjectDiffusion能够生成不同大小和位置的多个物体的能力。
https://arxiv.org/abs/2501.09194
Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.
图像描述已经成为视觉与语言研究中的一个关键任务。其目的是根据特定的图片或视频生成最准确的文字说明。科研界通过不断提出新的模型和方法来提升整体性能,取得了显著成果。然而,尽管有越来越多的研究提案提交,用于衡量这些改进的表现指标却多年来几乎没有变化。例如,现今像BLEU、METEOR、CIDEr和ROUGE这样的传统评价标准依然被广泛使用,即使是一些更为先进的度量标准如BertScore和ClipScore也不例外。因此,调整我们评估新图像描述提案进展、局限性和范围的方式变得至关重要,并且需要适应这些新的高级图像描述方法的新指标。 本工作提出了一种针对图像描述问题的新型评价指标。为此,首先构建了一个由人工标记的数据集来评估说明与图片内容的相关程度。基于这些人类评分作为基准事实,我们提出了一个新的度量标准,并将其与其他广为人知的经典及新近提出的度量标准进行了比较。此外,我们也发现了一些超越现有方法的结果,并分享了有趣的见解和讨论。
https://arxiv.org/abs/2501.09155
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.
CLIP(对比语言图像预训练)在模式识别和计算机视觉领域取得了巨大成功。将CLIP转移到下游任务(如零样本或少样本分类)是多模态学习中的热门话题。然而,当前的研究主要集中在文本提示学习或视觉适配器微调上,未能充分挖掘图像-文本对之间的互补信息和关联性。在本文中,我们提出了一种图像描述增强的CLIP适配器(IDEA)方法,用于将CLIP适应于少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕捉细粒度特征。IDEA是一种针对CLIP的无需训练的方法,在多个任务上可以与最先进的模型媲美甚至超过它们。 此外,我们引入了Trainable-IDEA(T-IDEA),它在IDEA的基础上增加了两个轻量级可学习组件(即投影器和可学习潜在空间),进一步提升了模型性能,并在11个数据集上实现了最先进的结果。作为一项重要贡献,我们采用了Llama模型并设计了一个综合的管道来为11个数据集上的图像生成文本描述,总共产生了1,637,795对图像-文本配对,命名为"IMD-11"。 我们的代码和数据可在以下网址获取:[https://this-url.com](请将URL替换为您实际提供的地址)。
https://arxiv.org/abs/2501.08816
Multi-modal explanation involves the assessment of the veracity of a variety of different content, and relies on multiple information modalities to comprehensively consider the relevance and consistency between modalities. Most existing fake news video detection methods focus on improving accuracy while ignoring the importance of providing explanations. In this paper, we propose a novel problem - Fake News Video Explanation (FNVE) - Given a multimodal news containing both video and caption text, we aim to generate natural language explanations to reveal the truth of predictions. To this end, we develop FakeNVE, a new dataset of explanations for truthfully multimodal posts, where each explanation is a natural language (English) sentence describing the attribution of a news thread. We benchmark FakeNVE by using a multimodal transformer-based architecture. Subsequently, a BART-based autoregressive decoder is used as the generator. Empirical results show compelling results for various baselines (applicable to FNVE) across multiple evaluation metrics. We also perform human evaluation on explanation generation, achieving high scores for both adequacy and fluency.
多模态解释涉及对各种不同内容真实性的评估,并依赖于多种信息模式来全面考虑各模态之间的相关性和一致性。现有的大多数假新闻视频检测方法侧重于提高准确性,而忽略了提供解释的重要性。在本文中,我们提出了一个新的问题——假新闻视频解释(FNVE):给定一个包含视频和字幕文本的多模态新闻内容,我们的目标是生成自然语言解释以揭示预测的真实情况。为此,我们开发了FakeNVE,这是一个用于真实多模态帖子的新型数据集,其中每个解释都是描述某个新闻线程归属权的自然语言(英语)句子。我们在FakeNVE上使用基于多模态Transformer的架构进行基准测试,并随后采用基于BART的自回归解码器作为生成器。实证结果显示,在多个评估指标中,各种基线方法(适用于FNVE)取得了令人信服的结果。此外,我们还进行了人工评价以评估解释生成的质量,获得了高分,涵盖了充分性和流畅性两个方面。
https://arxiv.org/abs/2501.08514
Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
遥感图像包含大量的对象和上下文视觉信息。最近的趋势是结合配对的卫星图像和文本描述进行预训练,以提高编码器在下游任务中的性能。然而,虽然对比学习方法(如CLIP)能够实现视觉与语言的一致性以及零样本分类能力,但仅基于视觉的任务性能通常会低于只使用图像的预训练方法,例如MAE。 本文提出了FLAVARS,这是一种结合了对比学习和掩码建模优点,并通过对比位置编码进行地理空间对齐的预训练方法。我们发现,在诸如KNN分类和语义分割等仅基于视觉的任务上,FLAVARS显著优于SkyCLIP基线方法,在SpaceNet1数据集上的mIOU指标提高了6%。同时,与只使用MAE预训练的方法不同,FLAVARS保留了零样本分类的能力。
https://arxiv.org/abs/2501.08490
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
我们介绍了Omni-RGPT,这是一种多模态大型语言模型,旨在促进图像和视频在区域级别的理解。为了实现空间-时间维度上的一致性区域表示,我们引入了Token Mark,即一组高亮视觉特征空间中目标区域的标记(token)。这些标记直接通过区域提示(例如框或掩码)嵌入到空间区域,并同时整合进文本提示以指定目标,从而建立了视觉和文本标记之间的直接联系。 为了进一步支持无需跟踪轨迹(video tracklets)就能稳健地理解视频,我们引入了一个辅助任务,该任务利用标记的一致性来指导Token Mark,在整个视频中实现稳定的区域解释。此外,我们还推出了一套大规模的基于区域级别的视频指令数据集(RegVID-300k)。 Omni-RGPT在图像和视频为基础的常识推理基准测试中取得了当前最佳的结果,并且在描述生成以及指称表达理解任务上也表现出强大的性能。
https://arxiv.org/abs/2501.08326
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine this http URL, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at this https URL .
最近在多模态大型语言模型(MLLM)方面的进展显示出了令人鼓舞的结果,但现有方法在同时处理时间和空间定位方面存在困难。这一挑战主要源自两个关键问题:首先,引入时空定位会带来大量的坐标组合,这使得语言和视觉坐标表示的对齐变得复杂;其次,在视频特征压缩过程中编码细粒度的时间和空间信息具有固有的难度。为了解决这些问题,我们提出了LLaVA-ST,这是一种用于细粒度时空多模态理解的MLLM。 在LLaVA-ST中,我们提出了一种语言对齐的位置嵌入方法,它将文本坐标特殊标记嵌入到视觉空间中,简化了细粒度时空对应关系的对齐。此外,我们设计了一个时空包装器(Spatial-Temporal Packer),该包装器将时间分辨率和空间分辨率的特征压缩解耦为两个独立的点到区域注意处理流。 为了促进细粒度时空多模态理解的研究,我们还提出了ST-Align数据集,包含430万个训练样本。通过ST-align,我们提出了一种渐进式训练流程,该流程通过从粗到细的顺序逐步对齐视觉和文本特征。此外,我们引入了一个评估时空交织的细粒度理解任务的基准测试,包括时空视频定位(STVG)、事件定位与描述(ELC)和空间视频定位(SVG)。 LLaVA-ST在需要细粒度时间、空间或时空交织多模态理解的11个基准测试上取得了卓越的成绩。我们的代码、数据集和基准测试将发布在这个链接:[请参阅原文档获取实际链接]。
https://arxiv.org/abs/2501.08282
This article introduces a benchmark designed to evaluate the capabilities of multimodal models in analyzing and interpreting images. The benchmark focuses on seven key visual aspects: main object, additional objects, background, detail, dominant colors, style, and viewpoint. A dataset of 14,580 images, generated from diverse text prompts, was used to assess the performance of seven leading multimodal models. These models were evaluated on their ability to accurately identify and describe each visual aspect, providing insights into their strengths and weaknesses for comprehensive image understanding. The findings of this benchmark have significant implications for the development and selection of multimodal models for various image analysis tasks.
这篇文章介绍了一个基准测试,旨在评估多模态模型在分析和解读图像方面的能力。该基准专注于七个关键视觉方面:主要对象、附加对象、背景、细节、主导颜色、风格和视角。使用由各种文本提示生成的包含14,580张图片的数据集来评估七种领先的多模态模型的表现。这些模型在其准确识别并描述每个视觉方面的能力上接受了评估,从而揭示了它们在全面图像理解中的优缺点。此基准测试的结果对于开发和选择用于多种图像分析任务的多模态模型具有重要的意义。
https://arxiv.org/abs/2501.08170
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
翻译如下: 标注变化已成为准确描述多时间序列遥感数据变化的关键,通过自然语言提供了一种直观的方式来监测地球动态。然而,现有的变化标注方法面临着两个主要挑战:由于多阶段融合策略导致的计算需求高;以及由于从单个图像中提取语义信息有限而导致的对象描述细节不足。为了解决这些挑战,我们提出了基于变压器模型并结合单一阶段特征融合技术的SAT-Cap,专门用于遥感变化标注。特别地,SAT-Cap整合了空间-通道注意力编码器、差异引导融合模块以及标题解码器。与需要在变换器编码器和融合模块中采用多阶段融合的一般模型不同,SAT-Cap仅使用基于余弦相似性的简单融合模块来集成信息,从而减少了模型架构的复杂性。通过在空间-通道注意力编码器中同时建模空间和通道信息,我们的方法显著增强了模型从多时间序列遥感图像中的对象提取语义信息的能力。广泛的实验验证了SAT-Cap的有效性,在LEVIR-CC数据集上实现了140.23%的CIDEr得分,并在DUBAI-CC数据集中达到了97.74%,超越了当前最先进的方法。代码和预训练模型将在网上提供。
https://arxiv.org/abs/2501.08114
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
面部表情描述在各个领域得到了广泛应用。最近,视频多模态大型语言模型(MLLM)在通用视频理解任务中展现出巨大潜力。然而,在视频中描述面部表情对这些模型提出了两个主要挑战:(1) 缺乏足够的数据集和基准;以及 (2) 视频 MLLM 的视觉标记容量有限。为解决这些问题,本文介绍了一个新的遵循指令的数据集,专门用于动态面部表情描述。该数据集包含5,033个高质量视频片段,并且这些片段都经过了手动标注,总计超过70万个令牌。它的目的是提高视频 MLLM 辨别细微面部变化的能力。 此外,我们提出了 FaceTrack-MM 模型,它利用有限数量的标记来编码主要人物的脸部信息。此模型在跟踪脸部和聚焦于主要角色的表情方面表现出色,即使是在复杂的多人场景中也是如此。另外,我们还引入了一种新的评估指标,结合事件提取、关系分类以及最长公共子序列(LCS)算法来评价生成文本的内容一致性和时间顺序一致性。 除此之外,我们推出了 FEC-Bench,这是一个基准测试工具,用于评估现有视频 MLLM 在这一特定任务中的表现。所有数据和源代码都将公开提供。
https://arxiv.org/abs/2501.07978
The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.
视频时刻检索(VMR)的目标是预测视频中与给定语言查询在语义上匹配的时间片段。现有的基于多模态大型语言模型(MLLMs)的VMR方法过于依赖昂贵且高质量的数据集和耗时的微调过程。尽管一些最近的研究引入了零样本设置以避免微调,但它们忽视了查询中固有的语言偏见,导致定位错误。为了应对上述挑战,本文提出了Moment-GPT,这是一个无须调整(tuning-free)的管道,用于利用冻结状态下的MLLM进行零样本VMR。 具体而言,我们首先使用LLaMA-3来纠正和重述查询以减轻语言偏见的影响。接着,我们设计了一个结合MiniGPT-v2使用的片段生成器,能够自适应地产生候选时间片段。最后,为了利用MLLM的视频理解能力,我们将VideoChatGPT与片段评分器相结合,选择最合适的时刻。 我们的方法在包括QVHighlights、ActivityNet-Captions和Charades-STA在内的多个公共数据集上显著优于现有的基于MLLM和零样本模型的方法。
https://arxiv.org/abs/2501.07972
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\% performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.
我们介绍了Tarsier2,这是一种最先进的大型视觉语言模型(LVLM),旨在生成详细且准确的视频描述,并展示出卓越的一般视频理解能力。通过三个关键升级,Tarsier2实现了显著的进步:(1) 将预训练数据从11M增加到40M个视频-文本对,丰富了数量和多样性;(2) 在监督微调期间执行细粒度的时间对齐;(3) 使用基于模型的采样自动构建偏好数据,并应用DPO(Dense Prediction Objective)训练进行优化。广泛的实验表明,在详细视频描述任务中,Tarsier2-7B在性能上持续优于包括GPT-4o和Gemini 1.5 Pro在内的领先专有模型。在DREAM-1K基准测试中,Tarsier2-7B相比GPT-4o将F1分数提高了2.8%,相较Gemini-1.5-Pro则提升了5.8%。在人类面对面的评估中,Tarsier2-7B相对于GPT-4o表现出8.6%的优势,而相较于Gemini-1.5-Pro则有高达24.9%的表现优势。此外,在涵盖视频问答、视频定位、幻觉测试和具身问答等任务的15个公开基准上,Tarsier2-7B均创下了新的最佳性能记录,展示了其作为强大的通用视觉语言模型的多功能性。
https://arxiv.org/abs/2501.07888
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
音频视频分割(AVS)的本质在于定位和界定视频流中发出声音的对象。虽然基于Transformer的方法显示出巨大的潜力,但由于处理长距离依赖时的二次计算成本问题,它们在复杂场景中的表现受到了限制。为了解决这一瓶颈,并以线性复杂度实现复杂的多模态理解,我们提出了一种名为AVS-Mamba的选择性状态空间模型来解决AVS任务。我们的框架整合了两个关键组件用于视频理解和跨模态学习:Temporal Mamba Block(时序马曼块)用于顺序处理视频和Vision-to-Audio Fusion Block(视觉到音频融合块)用于高级音视集成。 基于此,我们开发了多尺度时间编码器,旨在增强不同尺度下对视觉特征的学习能力,促进帧内及跨帧信息的感知。为了实现多模态融合,我们提出了模式聚合解码器,利用Vision-to-Audio Fusion Block将视觉特性与音频特性在帧级和时序层面进行整合。 此外,我们采用情境集成金字塔来执行音视空间-时间上下文协作。通过这些创新性的贡献,我们的方法在AVSBench-object和AVSBench-semantic数据集上达到了新的业界领先水平。我们的源代码和模型权重可以在AVS-Mamba项目中获取。
https://arxiv.org/abs/2501.07810