There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
https://arxiv.org/abs/2410.02763
We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
视频大型多模态模型(LMMs)的发展受到了从互联网上收集大量高质量原始数据困难的影响。为解决这个问题,我们提出了一个替代方法,即创建一个专门针对视频指令跟随的高质量合成数据集,称为LLaVA-Video-178K。这个数据集包括详细的字幕、开放性问题回答(QA)和多选题QA等关键任务。通过在这个数据集上训练,并与现有的视觉指令调整数据相结合,我们引入了LLaVA-Video,一种新的视频LMM。我们的实验结果表明,LLaVA-Video在各种视频基准测试中取得了强劲的性能,突出了我们数据集的有效性。我们计划发布该数据集、生成流程以及模型检查点。
https://arxiv.org/abs/2410.02713
Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.
视觉语言跟踪(VLT)已成为一个尖端的研究领域,利用语言数据来增强具有多模态输入的算法的性能,并将传统单对象跟踪(SOT)的范围扩展到涵盖视频理解应用。然而,大多数VLT基准仍然依赖于对每个视频的简洁、人类编写的文本描述。这些描述往往捕捉不到视频内容动态的细微之处,缺乏语言的风格多样性,受到其详细程度和固定注释周期的限制。因此,算法倾向于默认采用“记住答案”策略,从实现对视频内容更深刻理解的核心目标上偏离。 幸运的是,大型语言模型(LLMs)的出现已经使得生成多样文本成为可能。这项工作利用LLMs生成具有不同文本长度和粒度的多样语义注释(在语义层次上),从而建立了一个新颖的多模态基准。具体来说,我们(1)提出了一个名为DTVLT的新视觉语言跟踪基准,基于五个突出的VLT和SOT基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2)我们在基准中提供了四种粒度文本,考虑了语义信息的范围和密度。我们期望这种多粒度生成策略将为VLT和视频理解研究创造一个有利的环境。(3)我们对DTVLT进行了全面实验分析,评估了多样性文本对跟踪性能的影响,并希望识别出现有算法的性能瓶颈,以便进一步研究VLT和视频理解。提出的基准、实验结果和工具包将逐步发布在上述网址。
https://arxiv.org/abs/2410.02492
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
建模时变特征在音频波形表示学习中起着重要作用。我们提出了一种名为 Contrastive Long-form Language-Audio Pretraining (CoLLAP) 的方法,显著扩展了输入音频(长达5分钟)和语言描述(超过250个单词)的感知窗口,同时通过跨模态和时变动态进行对比学习。利用最近的 Music-LLMs 生成完整歌曲的长篇音乐摘要,并添加了音乐时序结构,我们收集了基于大型音频集训练数据集的51.3K个音频文本对,平均音频长度达到288秒。我们提出了一种新颖的对比学习架构,通过将语言表示与结构化音频表示结合,将每首歌曲分割为片段并提取它们的嵌入。通过关注机制,我们捕捉了多模态时变关联,使得模型能够自动权衡并增强最终的融合得分,从而改善对比对齐。最后,我们开发了两种类型的 CoLLAP 模型,分别为不同类型的骨干语言模型。通过在多个长篇音乐文本检索数据集上的全面实验,我们证明了与基线相比,检索准确性的提高是持续的。我们还证明了预训练的 CoLLAP 模型可以应用于各种音乐信息检索任务,包括具有异质长篇多模态上下文的各种任务。
https://arxiv.org/abs/2410.02271
Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind partly due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this work, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata, including free-form user-annotated tags, collected from an online music forum. To approach text-to-music generation, we leverage a pretrained large language model (LLM) to generate pseudo natural language captions from the metadata. With the LLM-enhanced MetaScore, we train a text-conditioned music generation model that learns to generate symbolic music from the pseudo captions, allowing control of instruments, genre, composer, complexity and other free-form music descriptors. In addition, we train a tag-conditioned system that supports a predefined set of tags available in MetaScore. Our experimental results show that both the proposed text-to-music and tags-to-music models outperform a baseline text-to-music model in a listening test, while the text-based system offers a more natural interface that allows free-form natural language prompts.
近年来,许多音频领域文本到音乐的生成模型依赖于大量文本音频对进行训练。然而,由于缺乏大规模符号音乐数据集以及丰富的元数据和字幕,符号域可控制音乐生成方面的发展滞后。在本文中,我们提出了MetaScore,一个由963K个音乐 score 配对 rich metadata(包括自由文本用户标注的标签,来源于在线音乐论坛)组成的新数据集。为了实现文本到音乐生成,我们利用预训练的大型语言模型(LLM)生成元数据中的自然语言字幕。通过LLM增强的MetaScore,我们训练了一个文本条件音乐生成模型,可以从伪字幕中学习生成符号音乐,实现对乐器、流派、作曲家、复杂程度和其他自由形式音乐描述器的控制。此外,我们训练了一个标签条件系统,支持预定义的标签集。我们的实验结果表明,与基线文本到音乐模型相比,所提出的文本到音乐和标签到音乐模型在听觉测试中表现更佳,而基于文本的系统则提供了更加自然的人工界面,允许自由文本用户提示。
https://arxiv.org/abs/2410.02084
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
我们提出了Synthio,一种用于在小型音频分类数据集中增加合成数据的新方法。我们的目标是利用有限标记数据来提高音频分类准确性。传统的数据增强技术(例如添加随机噪音或遮盖段落)很难创建具有真实世界音频中真实多样性的数据。为了克服这一缺陷,我们提出了一种通过从文本到音频(T2A)扩散模型生成的合成音频来丰富数据集的方法。然而,生成有效的增强数据具有挑战性,因为生成的数据不仅在声学上要与底层小型数据集保持一致,还应该具有足够的构成多样性。为了克服第一个挑战,我们通过偏好优化来对T2A模型的 generations进行对齐。这确保生成的数据的声学特征与小型数据集保持一致。为了应对第二个挑战,我们提出了一个利用大型语言模型的推理能力生成多样且意义的音频摘要的新颖 caption 生成技术。然后,生成的摘要用于提示对齐的T2A模型。我们对Synthio在十个数据集和四个模拟有限数据集上进行了广泛评估。结果表明,我们的方法通过仅使用弱捕获音频集训练的T2A模型,在0.1%到39%的基线方法上始终优于所有基线。
https://arxiv.org/abs/2410.02056
There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, multilingual retrieval case study, we quantify the existing lack of model flexibility. We empirically show performance gaps between training on captions that come from native German perception and captions that have been either machine-translated or human-translated from English into German. To address these gaps, we further propose and evaluate caption augmentation strategies. While we achieve mean recall improvements (+1.3), gaps still remain, indicating an open area of future work for the community.
目前的跨语言视觉语言模型屈指可数,这些模型没有充分考虑到不同语言和文化下图像注释中反映的感知差异。在这项工作中,通过多模态、多语言的检索案例研究,我们定量了现有模型灵活性的缺乏。我们通过实证研究展示了来自德语母语的注释训练和从英语翻译成德语的注释训练之间的性能差距。为了解决这些差距,我们进一步提出了并评估了标题增强策略。虽然我们实现了平均召回率的提高(+1.3),但差距仍然存在,表明社区未来还需要在这个领域进行更多的工作。
https://arxiv.org/abs/2410.02027
Surrogate models are used to predict the behavior of complex energy systems that are too expensive to simulate with traditional numerical methods. Our work introduces the use of language descriptions, which we call "system captions" or SysCaps, to interface with such surrogates. We argue that interacting with surrogates through text, particularly natural language, makes these models more accessible for both experts and non-experts. We introduce a lightweight multimodal text and timeseries regression model and a training pipeline that uses large language models (LLMs) to synthesize high-quality captions from simulation metadata. Our experiments on two real-world simulators of buildings and wind farms show that our SysCaps-augmented surrogates have better accuracy on held-out systems than traditional methods while enjoying new generalization abilities, such as handling semantically related descriptions of the same test system. Additional experiments also highlight the potential of SysCaps to unlock language-driven design space exploration and to regularize training through prompt augmentation.
代理模型用于预测复杂能源系统的行为,由于传统数值方法过于昂贵,因此难以进行模拟。我们的工作引入了语言描述,我们称之为“系统注释”或SysCaps,用于与这些代理模型进行交互。我们认为,通过文本交互来与代理模型互动,特别是自然语言,使这些模型对专家和非专家都更加易于使用。我们引入了一个轻量级的多模态文本和时间序列回归模型,并使用大型语言模型(LLMs)合成高质量注释,从仿真元数据中合成。我们对两个真实世界的建筑和风力发电机模拟器的实验结果表明,我们添加了SysCaps的代理模型在保留系统方面的准确率比传统方法更高,同时具有新的泛化能力,例如处理相同测试系统中语义相关的描述。此外的实验还突出了SysCaps通过引导语言驱动的设计空间探索以及通过提示增强来规范化训练的潜力。
https://arxiv.org/abs/2405.19653
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.
对比语言图像预训练(CLIP)模型在多样分布变化下的零散拍摄分类方面的显著潜力。在现有整体分类鲁棒性评估的基础上,本文旨在通过引入几个新的观点,对CLIP进行更全面的评估。首先,我们研究它们对特定视觉因素变化的鲁棒性。其次,我们评估两个关键的安全目标——信心不确定性和离群检测——超越了简单的分类准确性。第三,我们评估CLIP模型在图像和文本模态之间桥梁的精致程度。第四,我们将我们的研究扩展到CLIP模型在现代大型多模态模型(LMM)上的视觉先验,重点关注这种交互如何影响分类鲁棒性。在每一个方面,我们考虑六个因素对CLIP模型的影响:模型架构、训练分布、训练集大小、微调、对比损失和测试时间提示。我们的研究揭示了CLIP模型中视觉编码器架构对对抗3D损坏的鲁棒性的重要作用。CLIP模型在预测时倾向于展示形状偏见。此外,这种偏见在经过ImageNet的微调后通常会减小。利用CLIP视觉编码器的视觉-语言模型(如LLaVA)在具有挑战性类别的分类表现上可能优于仅使用CLIP的模型。我们的研究有望为提高CLIP模型的稳健性和可靠性提供宝贵的指导。
https://arxiv.org/abs/2410.01534
The task of action spotting consists in both identifying actions and precisely localizing them in time with a single timestamp in long, untrimmed video streams. Automatically extracting those actions is crucial for many sports applications, including sports analytics to produce extended statistics on game actions, coaching to provide support to video analysts, or fan engagement to automatically overlay content in the broadcast when specific actions occur. However, before 2018, no large-scale datasets for action spotting in sports were publicly available, which impeded benchmarking action spotting methods. In response, our team built the largest dataset and the most comprehensive benchmarks for sports video understanding, under the umbrella of SoccerNet. Particularly, our dataset contains a subset specifically dedicated to action spotting, called SoccerNet Action Spotting, containing more than 550 complete broadcast games annotated with almost all types of actions that can occur in a football game. This dataset is tailored to develop methods for automatic spotting of actions of interest, including deep learning approaches, by providing a large amount of manually annotated actions. To engage with the scientific community, the SoccerNet initiative organizes yearly challenges, during which participants from all around the world compete to achieve state-of-the-art performances. Thanks to our dataset and challenges, more than 60 methods were developed or published over the past five years, improving on the first baselines and making action spotting a viable option for the sports industry. This paper traces the history of action spotting in sports, from the creation of the task back in 2018, to the role it plays today in research and the sports industry.
动作识别的任务包括识别动作并准确地将它们在时间上与单个时刻戳配合。自动提取这些动作对许多体育应用程序至关重要,包括体育数据分析以产生关于比赛动作的扩展统计数据,教练提供支持给视频分析师,或者粉丝参与自动在直播中叠加特定动作。然而,在2018年之前,没有针对体育的大规模数据集可用于动作识别,这阻碍了基准动作识别方法的发展。为了应对这一挑战,我们的团队构建了体育视频理解的最大的数据集和最全面的基准,或在SoccerNet框架下。特别是,我们的数据集包含一个专门用于动作识别的部分,称为SoccerNet动作识别,包含几乎所有可以在足球比赛中发生的动作的完整直播比赛超过550个。这个数据集是为开发自动识别感兴趣动作的方法而设计的,包括深度学习方法,通过提供大量手动注释的动作。为了与科学界保持联系,SoccerNet项目组织年度挑战,期间来自世界各地的参与者竞争实现最先进的性能。感谢我们的数据集和挑战,在过去的五年里,有超过60种方法被开发或发布,提高了第一个基线,使动作识别成为体育行业的可行选项。本文回顾了体育动作识别的历史,从2018年该任务创建开始,到今天在研究和体育行业中所起的作用。
https://arxiv.org/abs/2410.01304
The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of-Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.
视觉语言模型(VLMs)的出现代表了对将计算机视觉与大型语言模型(LLMs)集成以生成详细文本描述视觉输入的显著进步。尽管它们的重要性不断增加,但VLMs的安全性,尤其是对后门攻击的防御,仍缺乏深入研究。此外,之前的 works 通常假设攻击者具有访问原始训练数据的能力,这在实际情况下并不现实。在本文中,我们讨论了一个更加实际和具有挑战性的场景,即攻击者只能依赖外部数据(OD)。我们引入了 VLOOD(在分布式数据上进行后门攻击的视觉语言模型),一种具有两个关键贡献的创新方法:(1)在复杂图像到文本任务中证明对 VLMs 的后门攻击,同时最小化在毒化输入下的原始语义贬损;(2)提出了一种不需要访问原始训练数据的创新后门注入技术。我们对图像标题和视觉问答(VQA)任务的评估证实了 VLOOD的有效性,揭示了 VLMs 中的关键安全漏洞,为未来研究在复杂威胁面前保护多模态模型奠定了基础。
https://arxiv.org/abs/2410.01264
Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
将本地化异常活动(如人为错误或监视事件)在视频中具有实际意义。然而,当前的视频理解模型在本地化这些异常事件方面遇到困难,这可能是因为它们在预训练数据集中的表示不足。为了探索基础模型在本地化异常活动方面的能力,我们引入了UAL-Bench,一个综合性的异常活动本地化基准,包括三个视频数据集:UAG-OOPS,UAG-SSBD和UAG-FunQA,以及一个指令微调数据集:OOPS-UAG-Instruct,以提高模型的能力。UAL-Bench评估了三种方法:视频语言模型(Vid-LLMs),指令微调的Vid-LLMs和视觉语言模型与大型语言模型的结合(VLM-LLM)。我们的结果表明,VLM-LLM方法在本地化短时异常事件和预测其发生(开始时间)方面比Vid-LLMs更准确。我们还提出了一个新的指标,R@1,TD <= p,以解决现有评估方法的局限性。我们的研究结果强调了长时间视频所带来的挑战,特别是在自闭症诊断场景中,并需要进一步改进定位技术。我们的工作不仅为异常活动本地化提供了一个基准,还揭示了现有基础模型的关键挑战,建议未来关于这个重要任务的进一步研究方向。
https://arxiv.org/abs/2410.01180
Accurately identifying, understanding, and describing driving safety-critical events (SCEs), including crashes and near-crashes, is crucial for traffic safety, automated driving systems, and advanced driver assistance systems research and application. As SCEs are rare events, most general Vision-Language Models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucination and missing key safety characteristics. To tackle these challenges, we propose ScVLM, a hybrid approach that combines supervised learning and contrastive learning to improve driving video understanding and event description rationality for VLMs. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigate hallucinations from VLMs.
准确地识别、理解和描述驾驶安全关键事件(SCEs),包括碰撞和近碰撞,对交通安全、自动驾驶系统和高级驾驶辅助系统的研究和应用至关重要。因为SCE事件是罕见的,大多数通用视觉语言模型(VLMs)都没有足够的训练来将SCE视频和情节联系起来,这可能导致幻觉和缺失关键安全特性。为了解决这些挑战,我们提出了ScVLM,一种结合监督学习和对比学习的方法,以提高VLMs的驾驶视频理解和事件描述合理性。所提出的方法通过对第二战略高速公路研究计划自然驾驶研究数据集中的8600多个SCE进行训练和评估得到。结果表明,与VLMs相比,所提出的方法在生成具有情境准确性的事件描述和减轻幻觉方面具有优越性。
https://arxiv.org/abs/2410.00982
In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{this https URL}
在本文中,我们提出了一种旨在提高图像-文本对齐预测的模型,针对当前视觉语言模型的合成理解挑战。我们的方法专注于通过生成基于积极图像的混合类型负捕获来生成高质量的训练数据。关键的是,我们解决了积极和消极捕获之间的分布不平衡,确保对齐模型不仅依赖文本信息,还考虑预测对齐的关联图像。通过创建这种增强训练数据,我们 Fine-tune 了一个现有的领先视觉语言模型,以增强其在理解对齐的能力。我们的模型在各种数据集上的性能都显著优于当前最佳方法。我们还通过根据文本对齐对图像生成器进行排名,证明了本模型的适用性。项目页面:\url{this <https:// this <https:// URL>
https://arxiv.org/abs/2410.00905
This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into unified reports, we acknowledge the intrinsic hierarchical relationship between the findings and impression section in radiograph datasets. To establish a targeted correspondence between images and texts, we propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings. Moreover, our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from (1) images, via a captioning branch, and (2) findings, through a summarization branch. Additionally, knowledge distillation is leveraged to facilitate the training process. Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements due to the shared self-attention and feed-forward architecture.
本文提出了一种在放射图表示学习领域的创新方法,即医学视觉-语言预训练(Med-VLP)。虽然传统方法通常将文本注释合并为统一报告,但我们承认放射图数据集中发现和印象部分之间的固有层次关系。为了建立图像和文本之间的目标对应关系,我们提出了一个新颖的HybridMED框架,该框架通过全局级别视觉表示与印象和词级视觉表示对研究结果进行对齐。此外,我们的框架还包括一个生成解码器,它采用两个代理任务生成印象(通过字幕分支)和发现结果(通过总结分支)。此外,我们还利用知识蒸馏来促进训练过程。在MIMIC-CXR数据集上的实验结果表明,我们的总结分支有效地将知识从摘要分支扩散到字幕分支,从而提高模型性能,而不会显著增加参数需求。
https://arxiv.org/abs/2410.00448
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).
近年来,在3D 大语言模型(3DLLMs)方面的最新进展突出了其在构建通用代理在现实世界中的潜力。然而,由于缺乏高质量的稳健指令跟随数据,导致3DLLMs的区分力和泛化能力受限。在本文中,我们介绍了 Robin3D,一种基于我们新颖的数据引擎 Robust Instruction Generation (RIG) 训练的大规模3DLLM。RIG生成两种关键指令数据:1)具有混合正负样本的对抗性指令跟随数据,以增强模型的区分性理解。2)包含各种指令风格的多样性指令跟随数据,以增强模型的泛化能力。因此,我们构建了100万条指令跟随数据,包括34.4万条对抗性样本、50.8万条多样样本和16.5万条基准训练集样本。为了更好地处理这些复杂指令,Robin3D 首先通过关系增强投影器增强空间理解,然后通过 ID- 特征绑定增强物体指认和 grounded能力。Robin3D在五个广泛使用的3D多模态学习基准测试中始终优于之前的方法,无需进行任务特定的微调。值得注意的是,我们在接地任务(Multi3DRefer)和字幕任务(Scan2Cap)上分别实现了7.8%和6.9%的改进。
https://arxiv.org/abs/2410.00255
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.
我们提出了MM1.5,一种新型的多模态大型语言模型(MLLM),旨在提高在文本丰富的图像理解、视觉参考和接地以及多图像推理方面的能力。在MM1架构的基础上,MM1.5采用了一种以数据为中心的训练方法,系统地探索了不同数据混合在整个模型训练生命周期中的影响。这包括高质量的OCR数据和合成字幕,用于持续预训练,以及用于监督微调的优化视觉指令数据混合。我们的模型从1B到30B个参数,涵盖密度和专家混合(MoE)变体,并证明了精细的数据选择和训练策略在小规模上也可以产生强大的性能(1B和3B)。此外,我们引入了两个专用变体:MM1.5-Video,用于视频理解,MM1.5-UI,专为移动UI理解而设计。通过广泛的实证研究和抽象,我们提供了对训练过程和决策的详细洞察,为未来MLLM发展的研究提供了宝贵的指导。
https://arxiv.org/abs/2409.20566