The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at this https URL.
在复杂场景中进行时空动作局部化的任务是高级视频理解的一个具有挑战性的任务。通过高质量的视频特征提取和增强检测器预测锚点的精度,可以有效地提高模型性能。为此,我们提出了一个高性能的双流时空特征提取网络SFMViT,采用锚点剪枝策略。我们SFMViT的骨干网络由ViT和SlowFast组成,基于先前对时空动作局部化的知识,充分利用ViT的卓越全局特征提取能力和SlowFast的时空序列建模能力。其次,我们引入了最大置信度堆来剪枝检测器在每个帧中检测到的锚点,以过滤出有效的锚点。这些设计使我们的SFMViT在Chaotic World数据集上的mAP达到26.62%,远超过现有模型。代码可以从该链接获得。
https://arxiv.org/abs/2404.16609
The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 times, and save up to 76% of parameters, while maintaining competitive performance. The code is available at this https URL.
最近基于MLP的局部隐式图像函数(LIIF)和后续的隐式神经表示(INR)在任意尺度超分辨率(ASSR)方面的研究取得了显著的成功,通过使用MLP解码低分辨率(LR)特征。然而,这些连续的图像表示通常在高分辨率(HR)和高维度(HD)空间中执行解码,导致计算成本增加,严重阻碍了ASSR的实际应用。为了解决这个问题,我们提出了一个新颖的潜在模块函数(LMF),它将高分辨率(HR)和高维度(HD)解码过程在LR-HD空间中进行共享隐式解码,在HR低维度(LD)空间中实现独立渲染,从而实现了连续图像表示的第一个计算最优范式。具体来说,LMF利用高维度(HD)的MLP在隐空间中生成每个LR特征向量的隐式模度。这使得在渲染空间中,模度强度与输入特征向量呈正相关,从而实现对输入图像复杂度的自适应调整。此外,我们利用模度强度与输入图像复杂度之间的正相关性,设计了一个可控制多尺度渲染(CMSR)算法,根据渲染精度调整解码效率。大量实验证明,将现有的INR为基础的ASSR方法转换为LMF可以降低计算成本至99.9%,加速推理至57倍,并节省约76%的参数,同时保持竞争力的性能。代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.16451
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.
尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。
https://arxiv.org/abs/2404.16365
In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search. We jointly learn a unified query embedding coupled with pin and product embeddings, leading to an improvement of $>8\%$ relevance, $>7\%$ engagement, and $>5\%$ ads CTR in Pinterest's production search system. The main contributors to these gains are improved content understanding, better multi-task learning, and real-time serving. We enrich our entity representations using diverse text derived from image captions from a generative LLM, historical engagement, and user-curated boards. Our multitask learning setup produces a single search query embedding in the same space as pin and product embeddings and compatible with pre-existing pin and product embeddings. We show the value of each feature through ablation studies, and show the effectiveness of a unified model compared to standalone counterparts. Finally, we share how these embeddings have been deployed across the Pinterest search stack, from retrieval to ranking, scaling to serve $300k$ requests per second at low latency. Our implementation of this work is available at this https URL.
在本文中,我们提出了OmniSearchSage,一个用于理解Pinterest搜索查询、列表和产品的多功能且可扩展的系统。我们共同学习了一个统一的查询嵌入与Pin和产品嵌入相结合,导致了在Pinterest的生成搜索系统中的相关性提高8%以上,参与度提高7%以上,广告CTR提高5%以上。这些增长的主要贡献是改进的内容理解、更好的多任务学习和实时 serving。我们通过从图像标题中提取多样性的文本来丰富实体表示,包括历史参与度和用户创建的列表。我们的多任务学习设置在相同的空间中产生了与Pin和产品嵌入兼容的单个搜索查询嵌入,并支持预先存在的Pin和产品嵌入。我们通过消融研究来展示每个功能的价值,并比较统一的模型与分离的模型的效果。最后,我们分享了这些嵌入如何应用于Pinterest搜索链的各个环节,从检索到排名,以及在低延迟下每秒服务300k请求的效果。本文中我们所做的工作的实现版本请点击以下链接获取:https://www.aclweb.org/anthology/W19-6231
https://arxiv.org/abs/2404.16260
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
近年来,基于内容的图像-文本预训练模型 deduplication 技术已经证明了,在训练过程中删除常见于网络的图像-文本数据集可以显著降低训练 Visual-Language Pre-trained (VLP) 模型的成本,同时不显著影响性能。这些结果基于从互联网上收集的常见图像-文本数据集进行裁剪——这些数据集已知可能包含有害的社交偏见,而这些偏见可能会在训练模型时进行编码。在这项工作中,我们评估了 deduplication 对训练模型的影响以及最近采用 SemDeDup 算法进行修改以减少我们观察到的负面效应的容易实现方法。当研究基于 deduplicated 变体的 CLIP 模型在 FairFace 和 FACET 数据集上进行训练时,我们发现,我们提出的公平 DeDup 算法在 FairDeDup 数据集上的性能始终优于 SemDeDup,同时保持在 CLIP 基准测试上的零散性能。
https://arxiv.org/abs/2404.16123
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into "image elements" that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition. Project page: this https URL
扩散模型在文本引导的合成任务中取得了显著的进展。然而,编辑用户提供的图像仍然具有挑战性,因为扩散模型的高维噪声输入空间并不自然适合图像反演或空间编辑。在这项工作中,我们提出了一种使用扩散模型促进输入图像空间编辑的图像表示。具体来说,我们学会了将输入编码成“图像元素”,这些元素可以忠实地重构输入图像。这些元素可以直观地编辑用户,并由扩散模型解码为逼真的图像。我们在各种图像编辑任务上展示了我们表示的有效性,包括对象缩放、重新排列、拖动、消除、去除、变化和图像组合。页面链接:这是 https:// this URL。
https://arxiv.org/abs/2404.16029
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
组合图像检索(CIR)是一个根据给定文本修改查询图像的任务。目前的方法依赖于有标签的三元组来训练CIR模型,这些三元组并不像简单的图像-文本对那么常见,从而限制了CIR的广泛应用和其可扩展性。另一方面,零散式CIR可以通过图像描述性对无监督训练进行相对容易的实现,但不考虑图像之间的关系,这种方法往往导致较低的准确性。我们提出了一种新的半监督CIR方法,其中我们在辅助数据中寻找参考图像及其相关目标图像,并使用基于大型语言模型(VDG)生成描述两个视觉差异(即视觉差)的文本。VDG,拥有流畅的语义知识,且对模型无依赖,可以生成伪三元组来提高CIR模型的性能。我们的方法显著提高了现有监督学习方法,并在CIR基准测试中实现了最先进的性能。
https://arxiv.org/abs/2404.15516
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
地理空间协同飞行器通过自然语言指令解锁了执行地球观测(EO)应用前所未有的潜力。然而,现有的代理依赖于过于简单的单一任务和基于模板的提示,与现实世界的场景存在割裂。在这项工作中,我们提出了GeoLLM-Engine,一个由远程 sensing 平台上的分析师定期执行复杂任务的工具增强代理的环境。我们通过添加地理空间 API 工具、动态地图/UI 和外部多模态知识库来丰富我们的环境,以便更准确地衡量代理在解释真实高级自然语言命令方面的熟练程度及其在任务完成中的功能性正确性。通过减轻与人类在环基准 Curation 相关的开销,我们在100个GPT-4-Turbo节点上充分利用我们的大规模并行引擎,扩展到超过50000个多样化的多工具任务和1100万卫星图像。通过超越传统单一任务图像捕捉范例,我们研究了最先进的代理和提示技术对抗长距离提示的现状。
https://arxiv.org/abs/2404.15500
Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints will be released at this https URL.
生成指定身份的高保真度人类视频在内容生成社区中引起了广泛关注。然而,现有的技术在训练效率和身份保留之间往往难以取得平衡,或者需要耗时的案例逐个微调,或者在视频生成过程中通常会丢失身份细节。在本文中,我们提出了ID-Animator,一种零散拍摄人类视频的方法,可以根据单个参考面部图像生成个性化的视频,无需进一步训练。ID-Animator继承了现有的扩散基视频生成骨架,带有面部适配器来编码与ID相关的特征嵌入。为了在视频生成过程中促进身份信息的提取,我们引入了从构建面部图像池中分离的人体属性和动作标题技术,ID导向的数据构建管道。基于该管道,我们还设计了一种随机的面部参考训练方法,精确捕捉参考图像中的ID相关嵌入,从而提高我们的模型在ID特定视频生成方面的保真度和泛化能力。大量实验证明,ID-Animator在生成个性化人类视频方面优于 previous 模型。此外,我们的方法与如animatediff 和各种社区骨架模型等热门预训练 T2V 模型高度兼容,在现实世界中,对于需要高度保留身份的视频生成,我们的方法具有很高的可扩展性。我们的代码和检查点将发布在https:// this URL。
https://arxiv.org/abs/2404.15275
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
近年来在大型预训练方面的进步导致开发了在多模态内容理解和解剖方面表现出众的先进视觉语言模型(VLMs)。尽管VLMs在复杂推理方面表现出色,但目前的模型通常很难有效地和精确地捕捉图像和文本两侧的组合信息。为了解决这个问题,我们提出了FineMatch,一个新的基于 aspects 的细粒度文本和图像匹配基准,重点关注文本和图像不匹配检测和纠正。这个基准为基于 aspects 的细粒度文本和图像匹配的 VLMs 的组合性评估引入了一个新的任务。在这个任务中,模型需要找出文本中的不匹配 aspects,确定 aspect 的类别,并针对可能包含 0 到 3 不匹配的图像-文本对提出修正。为了评估模型在新任务上的表现,我们提出了一个名为 ITM-IoU 的新评估指标,我们的实验结果表明它与人类评价高度相关。此外,我们还对现有的主流 VLMs 进行了全面的实验分析,包括完全监督学习和上下文学习场景。我们发现,在 FineMatch 上训练的模型在检测细粒度文本和图像不匹配方面表现更出色。此外,具有良好多模态上下文学习能力的模型(如 GPT-4V,Gemini Pro Vision)在细粒度组合图像和文本匹配分析方面并不熟练。通过 FineMatch,我们能够构建一个系统,用于检测文本到图像生成的幻觉,并进行修正。
https://arxiv.org/abs/2404.14715
Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.
评估多模态大型语言模型(MLLMs)的表现,将点云和语言集成在一起,带来了显著的挑战。缺乏全面的评估方法使得确定这些模型是否真正代表了进步成为不可能,从而阻碍了该领域进一步的发展。当前的评估主要依赖于分类和摘要任务,而无法提供对MLLMs的深入评估。迫切需要一种更复杂的评估方法,能够深入分析这些模型的空间理解和表现能力。为解决这些问题,我们引入了一个可扩展的3D基准,即3DBench,为全面评估MLLMs提供了一个扩展的平台。具体来说,我们建立了一个覆盖广泛的空间和语义尺度的基准,从物体级别到场景级别,解决了感知和规划任务。此外,我们提供了自动构建可扩展3D指令调整数据集的严格流程,包括10个多样化的多模态任务,总共生成了超过0.23百万个QA对。对热门MLLM的详细实验、与现有数据集的比较以及训练协议的变异,证明了3DBench的优越性,为当前的局限性和研究方向提供了宝贵的见解。
https://arxiv.org/abs/2404.14678
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
翻译 Temporal sentence grounding 涉及从自然语言查询中检索具有自然语言查询的视频时刻。 许多现有作品直接包含给定的视频和时间局部化的查询以进行时间关联,而忽视了不同模态之间固有的领域差距。在本文中,我们利用包含相同视频查询对中广泛的时间全局文本知识伪查询特征,来增强领域之间的桥梁,达到多模态特征之间更高的相似度。 具体来说,我们提出了一种名为 Pseudo-query Intermediary Network (PIN) 的伪查询中间网络,通过对比学习在特征空间中改善视觉和综合伪查询特征的对齐。 然后,我们使用可学习提示来封装伪查询的知识,将其传递到文本编码器和多模态融合模块中,进一步加强了视觉和语言之间的特征匹配,实现更好的时间关联。 在 Charades-STA 和 ActivityNet-Captions 数据集上进行的大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2404.13611
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。
https://arxiv.org/abs/2404.13594
Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
自动电影解说旨在为视觉受损的观众创建与视频对齐的剧情描述。它与标准视频字幕的不同之处在于,它不仅描述关键视觉细节,而且推断了多个电影镜头之间开发的剧情,因此提出了独特的持续挑战。为了推动自动电影解说系统的发展,我们首先回顾现有数据集的局限性,并开发了一个大规模、双语的电影解说数据集MOV101v2。其次,考虑到实现适用性电影解说所面临的根本困难,我们将长期目标分解为三个渐进阶段,并暂时将重点放在理解个体剪辑内的理解上。我们还引入了一个新的解说评估,以与我们的阶段任务目标保持一致。第三,使用我们的新数据集,我们对比了几个领先的大视图语言模型,包括GPT-4V,并对当前模型在电影解说生成方面的挑战进行了深入调查。我们的研究结果表明,实现适用性电影解说生成是一个迷人的目标,需要进行深入的研究。
https://arxiv.org/abs/2404.13370
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at this https URL.
作为多模态大型语言模型的关键组件,视觉编码器的能力对MLLM对多样图像内容的理解带来了很大的影响。尽管一些大型预训练视觉编码器(如CLIP和DINOv2中的视觉编码器)已经带来了良好的性能,但我们发现仍然没有一种视觉编码器可以主导各种图像内容的理解,例如,CLIP视觉编码器在通用图像理解方面表现出色,但在文档或图表内容上表现不佳。为了减轻CLIP视觉编码器的偏差,我们首先深入研究了不同预训练视觉编码器的固有行为,然后提出了MoVA,一种强大的新MLLM,通过粗到细的机制将任务特定的视觉专家与粗略到细的机制相结合。在粗粒度阶段,我们设计了一个基于用户指令、输入图像和视觉专家专业知识的上下文感知专家路由策略,根据这些信息动态选择最合适的视觉专家。这得益于配备专家路由低秩适应(LoRA)的大型语言模型(LLM)的强大的模型功能理解能力。在细粒度阶段,我们详细介绍了混合视觉专家适配器(MoV-Adapter),以提取和融合各种专家的任务特定知识。这种粗粒度到细粒度的范式有效地利用了多模态上下文和模型专业知识,进一步加强了泛化能力。我们对所提出的方法进行了广泛的实验评估。与任何华丽的装饰相比,MoVA在各种具有挑战性的多模态基准测试中的性能都有显著的提高。代码和模型将在此处https:// URL中提供。
https://arxiv.org/abs/2404.13046
AI in dermatology is evolving at a rapid pace but the major limitation to training trustworthy classifiers is the scarcity of data with ground-truth concept level labels, which are meta-labels semantically meaningful to humans. Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge by leveraging vast amounts of image-caption pairs available on the internet. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. However, CLIP's pre-training data is not well-aligned with the medical jargon that clinicians use to perform diagnoses. The development of large language models (LLMs) in recent years has led to the possibility of leveraging the expressive nature of these models to generate rich text. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and with the natural human language used in CLIP's pre-training data. Starting with captions used for images in PubMed articles, we extend them by passing the raw captions through an LLM fine-tuned on the field's several textbooks. We find that using captions generated by an expressive fine-tuned LLM like GPT-3.5 improves downstream zero-shot concept classification performance.
翻译:皮肤病学中的人工智能正在以快速发展的速度不断演变,但训练可靠分类器的主要限制是缺乏具有真实概念级别标签的数据,这些标签在人类中具有语义意义。像CLIP这样的基础模型提供零散能力,通过利用互联网上大量可用的图像标题对数据进行微调,可以缓解这一挑战。CLIP可以通过针对特定领域的图像标题进行微调来提高分类性能。然而,CLIP的预训练数据与医生在诊断过程中使用的医学术语并不完全对齐。近年来,大型语言模型(LLMs)的发展使得利用这些模型的表达性特征生成丰富文本的可能性成为可能。我们的目标是使用这些模型生成与临床词汇和CLIP预训练数据中自然人类语言相符的文本。从PubMed文章中使用的图像的摘要开始,我们通过在几个教材上微调的LLM对原始摘要进行扩展。我们发现,使用像GPT-3.5这样的具有表达性的微调的LLM生成captions可以提高下游的零散概念分类性能。
https://arxiv.org/abs/2404.13043
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: this https URL.
我们提出了Groma,一种具有 grounded 和 fine-grained 视觉感知能力的多模态大型语言模型(MLLM)。除了全局图像理解之外,Groma 还擅长诸如区域注释和视觉 grounding 之类的区域级别任务。这些能力基于局部视觉标记化机制,其中图像输入被分解成感兴趣的区域并随后编码成区域标记。通过将区域标记集成到用户指令和模型响应中,我们使 Groma 能够理解用户指定的区域输入,并将文本输出与图像 grounding 相结合。此外,为了增强 Groma 的 grounded chat 能力,我们利用 GPT-4V 的强大的视觉提示技术和视觉数据增强方法,收集了一个视觉 grounded 指令数据集。与依赖于语言模型或外部模块进行 localization 的 MLLM 相比,Groma 在标准参考和 grounding 基准测试中 consistently表现出卓越的性能,突显了将 localization 嵌入到图像标记化中的优势。项目页面:此链接:https:// this URL。
https://arxiv.org/abs/2404.13013