Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
https://arxiv.org/abs/2501.09555
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.
图像描述已经成为视觉与语言研究中的一个关键任务。其目的是根据特定的图片或视频生成最准确的文字说明。科研界通过不断提出新的模型和方法来提升整体性能,取得了显著成果。然而,尽管有越来越多的研究提案提交,用于衡量这些改进的表现指标却多年来几乎没有变化。例如,现今像BLEU、METEOR、CIDEr和ROUGE这样的传统评价标准依然被广泛使用,即使是一些更为先进的度量标准如BertScore和ClipScore也不例外。因此,调整我们评估新图像描述提案进展、局限性和范围的方式变得至关重要,并且需要适应这些新的高级图像描述方法的新指标。 本工作提出了一种针对图像描述问题的新型评价指标。为此,首先构建了一个由人工标记的数据集来评估说明与图片内容的相关程度。基于这些人类评分作为基准事实,我们提出了一个新的度量标准,并将其与其他广为人知的经典及新近提出的度量标准进行了比较。此外,我们也发现了一些超越现有方法的结果,并分享了有趣的见解和讨论。
https://arxiv.org/abs/2501.09155
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.
CLIP(对比语言图像预训练)在模式识别和计算机视觉领域取得了巨大成功。将CLIP转移到下游任务(如零样本或少样本分类)是多模态学习中的热门话题。然而,当前的研究主要集中在文本提示学习或视觉适配器微调上,未能充分挖掘图像-文本对之间的互补信息和关联性。在本文中,我们提出了一种图像描述增强的CLIP适配器(IDEA)方法,用于将CLIP适应于少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕捉细粒度特征。IDEA是一种针对CLIP的无需训练的方法,在多个任务上可以与最先进的模型媲美甚至超过它们。 此外,我们引入了Trainable-IDEA(T-IDEA),它在IDEA的基础上增加了两个轻量级可学习组件(即投影器和可学习潜在空间),进一步提升了模型性能,并在11个数据集上实现了最先进的结果。作为一项重要贡献,我们采用了Llama模型并设计了一个综合的管道来为11个数据集上的图像生成文本描述,总共产生了1,637,795对图像-文本配对,命名为"IMD-11"。 我们的代码和数据可在以下网址获取:[https://this-url.com](请将URL替换为您实际提供的地址)。
https://arxiv.org/abs/2501.08816
This article introduces a benchmark designed to evaluate the capabilities of multimodal models in analyzing and interpreting images. The benchmark focuses on seven key visual aspects: main object, additional objects, background, detail, dominant colors, style, and viewpoint. A dataset of 14,580 images, generated from diverse text prompts, was used to assess the performance of seven leading multimodal models. These models were evaluated on their ability to accurately identify and describe each visual aspect, providing insights into their strengths and weaknesses for comprehensive image understanding. The findings of this benchmark have significant implications for the development and selection of multimodal models for various image analysis tasks.
这篇文章介绍了一个基准测试,旨在评估多模态模型在分析和解读图像方面的能力。该基准专注于七个关键视觉方面:主要对象、附加对象、背景、细节、主导颜色、风格和视角。使用由各种文本提示生成的包含14,580张图片的数据集来评估七种领先的多模态模型的表现。这些模型在其准确识别并描述每个视觉方面的能力上接受了评估,从而揭示了它们在全面图像理解中的优缺点。此基准测试的结果对于开发和选择用于多种图像分析任务的多模态模型具有重要的意义。
https://arxiv.org/abs/2501.08170
Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at this https URL.
自动化胸部X光片解读要求既要有准确的疾病分类,又需要生成详细的放射学报告,这对临床工作流程提出了重大挑战。目前的方法要么侧重于牺牲可解释性的准确性,要么通过图像描述技术生成详细但可能不可靠的报告。在这项研究中,我们介绍了RadAlign,这是一种结合了视觉-语言模型(VLM)预测准确性和大型语言模型(LLM)推理能力的新框架。受放射科医生工作流程的启发,RadAlign首先利用一种专门设计的VLM将视觉特征与关键医学概念对齐,在多种疾病上的平均AUC达到了0.885,从而实现了卓越的疾病分类准确性。然后,这些作为文本概念识别出的医学状况被用于通过LLM驱动的报告生成进行提示。借助于检索增强型生成机制来确保输出以类似的历史案例为基础,RadAlign能够提供更高的报告质量,GREEN评分为0.678,优于现有最佳方法的0.634。我们的框架在减少幻觉的同时保持了强大的临床可解释性,通过集成预测和生成式AI推进了自动化医学影像及报告分析的发展。代码可在该链接获取:[此处应插入实际URL]。
https://arxiv.org/abs/2501.07525
Accurate human posture classification in images and videos is crucial for automated applications across various fields, including work safety, physical rehabilitation, sports training, or daily assisted living. Recently, multimodal learning methods, such as Contrastive Language-Image Pretraining (CLIP), have advanced significantly in jointly understanding images and text. This study aims to assess the effectiveness of CLIP in classifying human postures, focusing on its application in yoga. Despite the initial limitations of the zero-shot approach, applying transfer learning on 15,301 images (real and synthetic) with 82 classes has shown promising results. The article describes the full procedure for fine-tuning, including the choice for image description syntax, models and hyperparameters adjustment. The fine-tuned CLIP model, tested on 3826 images, achieves an accuracy of over 85%, surpassing the current state-of-the-art of previous works on the same dataset by approximately 6%, its training time being 3.5 times lower than what is needed to fine-tune a YOLOv8-based model. For more application-oriented scenarios, with smaller datasets of six postures each, containing 1301 and 401 training images, the fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively. Furthermore, our experiments indicate that training with as few as 20 images per pose can yield around 90% accuracy in a six-class dataset. This study demonstrates that this multimodal technique can be effectively used for yoga pose classification, and possibly for human posture classification, in general. Additionally, CLIP inference time (around 7 ms) supports that the model can be integrated into automated systems for posture evaluation, e.g., for developing a real-time personal yoga assistant for performance assessment.
在图像和视频中准确分类人体姿态对于工作安全、物理康复、体育训练以及日常生活辅助等领域中的自动化应用至关重要。最近,多模态学习方法(如对比语言-图像预训练模型CLIP)在同时理解图像和文本方面取得了显著进展。本研究旨在评估CLIP在分类人类姿势方面的有效性,并重点关注其在瑜伽领域的应用。 尽管零样本方法存在初期限制,但通过对包含15,301张图片(真实和合成的)与82类别的数据集进行迁移学习,已经显示出非常有希望的结果。文章详细描述了微调过程中的全部步骤,包括选择图像描述语法、模型及超参数调整。 经过微调后的CLIP模型,在测试包含3826张图片的数据集上达到了超过85%的准确率,比之前在同一数据集中使用的方法高出了大约6%,且训练时间仅为基于YOLOv8模型微调所需时间的大约三分之一。对于更多应用场景,即每个姿势只有六类的小型数据集(分别包含1301张和401张训练图片),经过微调后的模型准确率达到了98.8%和99.1%,而且在每个姿势仅用20幅图像的数据集中,其在六个类别上的准确率约为90%。 这项研究表明,这种多模态技术可以有效地用于瑜伽姿势分类,并且可能适用于一般的人体姿态分类。此外,CLIP的推理时间(大约7毫秒)表明该模型可以集成到自动化系统中以评估人体姿态,例如开发实时个人瑜伽助手进行表现评估。
https://arxiv.org/abs/2501.07221
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible this http URL framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data this http URL average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.
视觉-语言模型(VLMs)的发展受到大规模且多样化的多模态数据集的驱动。然而,面向生物医学领域的通才型视觉-语言模型的进步受限于生物学和医学领域内缺乏注释且公开可访问的数据集。现有的努力仅限于狭窄的特定领域,未能涵盖科学文献中编码的全部生物医学知识多样性。为解决这一缺口,我们推出了BIOMEDICA,这是一个可扩展、开源框架,用于提取、标注并序列化PubMed Central开放获取子集中所有内容,以形成易于使用的公共访问数据集。该框架生成了一个包含来自超过600万篇文章中2400多万个独特图像-文本对的全面档案。此外还提供了元数据和专家引导注释。通过发布BMCA-CLIP套件——一系列基于流式传输在BIOMEDICA数据集上持续预训练的CLIP风格模型,我们展示了该资源的实用性和可访问性,这套方案消除了下载27TB数据的需求。平均而言,在涵盖病理学、放射学、眼科、皮肤病学、外科手术、分子生物学、寄生虫学和细胞生物学等40项任务中,我们的模型达到了最先进的性能表现,并在零样本分类上取得了显著提升(高达29.8%和17.5%,分别出现在皮肤病学和眼科领域),同时图像-文本检索能力更强。所有这一切都仅需使用十分之一的计算资源来实现。为促进研究可重复性和合作,我们向更广泛的科研社区开放了我们的代码库和数据集。
https://arxiv.org/abs/2501.07171
Considering the lack of a unified framework for image description and deep cultural analysis at the subject level in the field of Ancient Chinese Paintings (ACP), this study utilized the Beijing Palace Museum's ACP collections to develop a semantic model integrating the iconological theory with a new workflow for term extraction and mapping. Our findings underscore the model's effectiveness. SDM can be used to support further art-related knowledge organization and cultural exploration of ACPs.
鉴于在古代中国绘画(ACP)领域缺乏统一的图像描述和深度文化分析框架,本研究利用北京故宫博物院的ACP收藏开发了一种语义模型,该模型结合了图像学理论,并设计了一个新的术语提取和映射工作流程。我们的研究成果强调了该模型的有效性。SDM可以用于支持进一步的艺术相关知识组织以及对ACP的文化探索。
https://arxiv.org/abs/2501.08352
Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at this https URL.
此前关于增强大规模多模态模型(LMMs)用于文本到图像(T2I)生成的研究主要集中在扩充上下文学习(ICL)的输入空间。这包括提供几个演示和优化图片描述,使其更加详细和逻辑化。然而,随着对更复杂和灵活的图片描述的需求增长,在ICL范式下增强输入文本的理解仍然是一个关键但尚未充分探索的领域。 在这项工作中,我们扩展了这一研究方向,通过构建并行多语言提示来利用LMMs的多语言能力。具体而言,我们将输入文本翻译成几种不同的语言,并向模型提供原始文本及其译文。在两个LMM和三个基准测试上的实验表明,我们的方法PMT2I(Parallel Multilingual Text-to-Image)在通用性、组合性和细粒度评估中均表现出色,尤其是在与人类偏好的对齐方面。此外,由于其生成更多样化图像的优势,在结合重排序方法时,PMT2I显著优于基线提示。 我们的代码和并行多语言数据可在以下链接找到:[此URL](请将"this https URL"替换为实际的网址)。
https://arxiv.org/abs/2501.07086
Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
多模态大型语言模型(MLLMs)在图像和区域级别的遥感(RS)图像理解任务中取得了显著的成功,例如图像描述、视觉问答和视觉指代。然而,现有的遥感MLLM缺乏像素级对话能力,即根据用户的指令生成特定实例的分割掩码进行响应的能力。为此,在本文中我们提出了GeoPix,这是一种扩展了遥感图像理解至像素级别能力的MLLM。通过为模型装备一个掩码预测器来实现这一点,该预测器能够将视觉特征从视觉编码器转换成基于大型语言模型(LLM)分割令牌嵌入条件下的掩码。 为了支持RS图像中多尺度对象的分割,在掩码预测器中集成了按类别可学习的记忆模块,以捕获并存储整个数据集中每个实例级别的地物上下文。此外,由于缺乏训练像素级别遥感MLLM的大规模数据集,我们构建了GeoPixInstruct数据集,该数据集包含65,463张图像和140,412个实例,并为每个实例提供了文本描述、边界框以及掩码的标注信息。 为了平衡多模态多任务优化中文字生成与掩码预测的不同需求,我们还开发了一种两阶段训练策略。大量的实验验证了GeoPix在像素级分割任务中的有效性和优越性,同时保持了图像和区域级别基准测试中的竞争力。
https://arxiv.org/abs/2501.06828
This paper presents a novel method for accelerating path-planning tasks in unknown scenes with obstacles by utilizing Wasserstein Generative Adversarial Networks (WGANs) with Gradient Penalty (GP) to approximate the distribution of waypoints for a collision-free path using the Rapidly-exploring Random Tree algorithm. Our approach involves conditioning the WGAN-GP with a forward diffusion process in a continuous latent space to handle multimodal datasets effectively. We also propose encoding the waypoints of a collision-free path as a matrix, where the multidimensional ordering of the waypoints is naturally preserved. This method not only improves model learning but also enhances training convergence. Furthermore, we propose a method to assess whether the trained model fails to accurately capture the true waypoints. In such cases, we revert to uniform sampling to ensure the algorithm's probabilistic completeness; a process that traditionally involves manually determining an optimal ratio for each scenario in other machine learning-based methods. Our experiments demonstrate promising results in accelerating path-planning tasks under critical time constraints. The source code is openly available at this https URL.
本文提出了一种新颖的方法,利用具有梯度惩罚(GP)的瓦瑟斯坦生成对抗网络(WGANs)来加速在存在障碍物的未知场景中的路径规划任务。通过快速探索随机树算法,该方法能够近似计算出一条无碰撞路径的关键点分布。我们的方法包括在一个连续的潜在空间中使用前向扩散过程对WGAN-GP进行条件化处理,以有效应对多模态数据集。此外,我们还提出了将一个无碰撞路径的关键点编码为矩阵的方法,在该矩阵中关键点的多维顺序自然地得到了保持。这种方法不仅提高了模型的学习能力,而且还加快了训练收敛速度。 本文进一步提出了一种评估经过训练的模型是否未能准确捕捉真实关键点的方法。在这些情况下,我们将退回到均匀采样以确保算法的概率完备性;这一过程传统上需要为每种情况手动确定一个最优的比例,在其他基于机器学习的方法中也是如此进行的。实验结果表明,在面临紧迫时间限制的情况下,该方法能够显著加快路径规划任务的速度。 源代码可在以下网址公开获取:[此链接](请将方括号中的文本替换为实际提供的URL)。
https://arxiv.org/abs/2501.06639
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at this https URL.
近期,视觉-语言模型取得了显著进展,在图像描述和视频理解等任务中表现出色。我们介绍了Valley2,这是一种新型的多模态大型语言模型,旨在增强所有领域的性能,并在电子商务和短视频场景的实际应用边界上进行扩展。值得注意的是,Valley2在电子商务基准测试中达到了最先进的(SOTA)水平,远超同类开源模型的表现(79.66 对比 72.76)。此外,在参数少于100亿的模型中,Valley2在OpenCompass排行榜上排名第二,并且以平均得分67.4的成绩表现出色。代码和模型权重可在[此处](https://this.http URL)开源获取。
https://arxiv.org/abs/2501.05901
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。
https://arxiv.org/abs/2501.05452
Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning. However, their inference efficiency has been a notable concern, as the increasing length of multimodal contexts leads to quadratic complexity. Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs. Yet, these approaches have struggled to keep pace with the rapid advancements in MLLMs, especially the AnyRes strategy in the context of high-resolution image understanding. In this paper, we propose a novel token compression method, GlobalCom$^2$, tailored for high-resolution MLLMs that receive both the thumbnail and multiple crops. GlobalCom$^2$ treats the tokens derived from the thumbnail as the ``commander'' of the entire token compression process, directing the allocation of retention ratios and the specific compression for each crop. In this way, redundant tokens are eliminated while important local details are adaptively preserved to the highest extent feasible. Empirical results across 10 benchmarks reveal that GlobalCom$^2$ achieves an optimal balance between performance and efficiency, and consistently outperforms state-of-the-art token compression methods with LLaVA-NeXT-7B/13B models. Our code is released at \url{this https URL}.
多模态大型语言模型(MLLMs)由于在视觉内容理解和推理方面表现出色而备受关注。然而,随着多模态上下文长度的增加,它们的推断效率成为了一个显著的问题,因为这导致了二次复杂性的增长。通过减少视觉标记数量来降低计算成本的技术已经证明了自己的有效性。但是,这些方法难以跟上MLLM快速发展的步伐,特别是在高分辨率图像理解方面,尤其是在AnyRes策略的应用背景下更为明显。 本文中,我们提出了一种新的令牌压缩方法GlobalCom$^2$,专门针对接收缩略图和多个裁剪区域的高分辨率多模态模型。在GlobalCom$^2$中,从缩略图衍生出的标记被视为整个令牌压缩过程中的“指挥官”,指导保留比率的分配以及每个裁剪的具体压缩操作。通过这种方式,在消除冗余令牌的同时,重要局部细节能够被尽可能地自适应保存。 实验结果表明,在10个基准测试上,GlobalCom$^2$在性能和效率之间实现了最优平衡,并且使用LLaVA-NeXT-7B/13B模型时,它始终优于最先进的令牌压缩方法。我们的代码可以在\url{this https URL}找到。
https://arxiv.org/abs/2501.05179
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
迄今为止,大多数大型视觉-语言模型(LVLM)主要在英语数据上进行训练,这使得它们难以理解非英语输入,并且无法生成所需目标语言的输出。现有的努力通过添加多语种训练数据来缓解这些问题,但这些方法大多是临时性的,缺乏对不同语言组合如何影响各种语言群体的理解。在这项工作中,我们全面研究了大规模多语言LVLM的训练策略。首先,我们进行了一系列跨越13个下游视觉-语言任务和43种语言的多阶段实验,系统地检查以下内容:(1)在不降低英语性能的情况下可以包含多少培训语言;(2)预训练以及(3)指令微调数据的最佳语言分布情况。此外,我们还(4)研究了如何提高跨语种文本图像理解,并为此任务引入了一个新的基准测试。令人惊讶的是,我们的分析揭示了:(i)一次可以同时包括多达100个培训语言;(ii)只需使用25-50%的非英语数据即可显著提升多语性能,同时保持强大的英语性能。(iii)在预训练和指令微调中包含非英语OCR数据对于提高跨语种文本图像理解至关重要。最后,我们将所有研究成果综合起来,训练出Centurio,这是一个支持100种语言的LVLM,在涵盖14项任务和56种语言的评估中达到了最先进的性能水平。
https://arxiv.org/abs/2501.05122
Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of "registers" which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled register embeddings to create feature representations which are subsequently used for training a downstream classifier. We find that this enhances OOD generalization and anomaly rejection, while maintaining in-distribution (ID) performance. Extensive experiments across multiple ViT backbones trained with and without registers reveal consistent improvements of 2-4\% in top-1 OOD accuracy and a 2-3\% reduction in false positive rates for anomaly detection. Importantly, these gains are achieved without additional computational overhead.
视觉变压器(ViTs)由于能够捕捉全局图像表示,在各种任务中表现出色。近期研究发现,高范数令牌在 ViT 中存在,并可能干扰无监督对象的发现。为了解决这个问题,提出了使用“寄存器”的方法——这些是额外的令牌,可以隔离高范数补丁令牌同时捕获全局图像级别的信息。尽管寄存器已经在物体发现方面得到了广泛研究,但在分布外(OOD)场景中的泛化性能仍然未被充分探索。 在这篇论文中,我们探讨了寄存器令牌嵌入在提供额外特征以改善泛化和异常检测方面的效用。为此,我们提出了一种简单的方法:结合 ViT 中常用的特殊 CLS 令牌嵌入与平均池化的寄存器嵌入来创建用于训练下游分类器的特征表示。我们发现这种方法可以增强分布外(OOD)的泛化能力和异常检测的拒绝能力,并且同时保持了分布内(ID)的表现力。 在多个使用和不使用寄存器训练的 ViT 后端上进行广泛的实验表明,与没有寄存器的方法相比,在 top-1 分布外准确率方面有 2-4% 的一致改进,对于异常检测的假阳性率则减少了 2-3%。重要的是,这些收益是在不增加额外计算开销的情况下实现的。
https://arxiv.org/abs/2501.04784
Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.
将自动预测的人类反馈融入生成模型的训练过程引起了近期广泛的关注,而推理阶段的反馈则较少受到重视。在训练过程中常见的反馈类型(即给定两个样本时的选择偏好)并不自然地适用于推理阶段。我们提出了一种新型反馈——标题改写,并基于人类注释训练模型模仿这种改写反馈。我们的方法不需要重新训练图像描述生成模型,从而大大减少了计算需求。我们在两种类型的改写反馈上进行了实验:首先,收集了一个数据集,其中包含纠正生成标题错误的人类改写。我们发现,在现有图像描述模型的推理阶段集成基于这些数据训练的改写模型可以提高标题质量,尤其是在原始标题质量较低的情况下效果更佳。我们将该方法应用于非英语图像描述领域——这是一个稳健模型较少见的领域,并获得了显著改进。其次,我们在风格迁移中应用了这种改写技术。定量评估显示,在德语图像描述和英语风格迁移方面达到了最先进的性能,而详细对比框架下的人类验证揭示了具体的改进维度。
https://arxiv.org/abs/2501.04513
Accurate segmentation of pulmonary structures iscrucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require much labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. Exploiting the generalization ability of these pre-trained foundation models on downstream tasks, such as segmentation, leads to unexpected performance with a relatively small amount of labeled data. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework. Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a s pecially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date and consists of 718 labeled data in total. The experiments show that our method outperformed other state-of-the-art methods by a large margin. Our data and code will be made publicly available upon acceptance.
肺部结构的精确分割在临床诊断、疾病研究和治疗规划中至关重要。基于深度学习的分割技术取得了显著进展,但大多数方法仍需要大量标注数据进行训练。因此,在医学影像分析中开发需要较少标注数据集且准确度高的分割方法变得尤为重要。最近,预训练视觉-语言基础模型(如CLIP)的出现为通用计算机视觉任务打开了大门。这些预训练的基础模型在下游任务(例如分割任务)中的泛化能力可以通过少量标记数据实现意外的良好性能。然而,目前利用这类模型进行肺动脉和静脉分割的研究还很有限。 本文提出了一种新的框架,称为“语言引导自适应交叉注意力融合框架”(Language-guided self-adaptive Cross-Attention Fusion Framework)。我们的方法采用了预训练的CLIP作为强大的特征提取器来生成3D CT扫描的分割结果,并且能够自适应地聚合文本和图像表示之间的跨模态信息。我们设计了一个特殊的适配模块,利用自适应学习策略对预训练的CLIP进行微调,以有效地融合这两种模式的嵌入。 我们在一个本地数据集上广泛验证了我们的方法,该数据集是迄今为止最大的肺动脉静脉CT数据集,总共包含718个标注样本。实验结果表明,我们的方法在性能上显著优于其他最先进的方法。我们的数据和代码将在论文被接受后公开发布。
https://arxiv.org/abs/2501.03722