Point cloud segmentation (PCS) plays an essential role in robot perception and navigation tasks. To efficiently understand large-scale outdoor point clouds, their range image representation is commonly adopted. This image-like representation is compact and structured, making range image-based PCS models practical. However, undesirable missing values in the range images damage the shapes and patterns of objects. This problem creates difficulty for the models in learning coherent and complete geometric information from the objects. Consequently, the PCS models only achieve inferior performance. Delving deeply into this issue, we find that the use of unreasonable projection approaches and deskewing scans mainly leads to unwanted missing values in the range images. Besides, almost all previous works fail to consider filling in the unexpected missing values in the PCS task. To alleviate this problem, we first propose a new projection method, namely scan unfolding++ (SU++), to avoid massive missing values in the generated range images. Then, we introduce a simple yet effective approach, namely range-dependent $K$-nearest neighbor interpolation ($K$NNI), to further fill in missing values. Finally, we introduce the Filling Missing Values Network (FMVNet) and Fast FMVNet. Extensive experimental results on SemanticKITTI, SemanticPOSS, and nuScenes datasets demonstrate that by employing the proposed SU++ and $K$NNI, existing range image-based PCS models consistently achieve better performance than the baseline models. Besides, both FMVNet and Fast FMVNet achieve state-of-the-art performance in terms of the speed-accuracy trade-off. The proposed methods can be applied to other range image-based tasks and practical applications.
点云分割(PCS)在机器人感知和导航任务中发挥着关键作用。为了有效地理解大型户外点云,通常采用它们的范围图像表示。这种图像似表示紧凑且结构化,使得基于范围图像的PCS模型具有实际应用价值。然而,范围图像中的不良缺失值破坏了物体的形状和图案。这个问题使得模型在从物体中学习连贯和完整的几何信息方面遇到了困难。因此,PCS模型只实现了较差的性能。 深入研究这个问题后,我们发现,使用不合理的投影方法和倾斜扫描主要导致范围图像中的不良缺失值。此外,几乎所有以前的工作都没有考虑在PCS任务中填充意外的缺失值。为了解决这个问题,我们首先提出了一种新的投影方法,即扫描展开++(SU++),以避免在生成的范围图像中出现大规模的缺失值。然后,我们引入了一种简单而有效的方法,即基于范围的K-最近邻插值(KNNI)来进一步填充缺失值。最后,我们引入了填充缺失值网络(FMVNet)和快速FMVNet。在SemanticKITTI、SemanticPOSS和nuScenes数据集上的大量实验结果表明,通过采用所提出的SU++和KNNI,现有的范围图像基于PCS模型在性能上始终优于基线模型。此外,FMVNet和快速FMVNet在速度与准确度权衡方面都实现了最先进的性能。所提出的方法可以应用于其他范围图像基于任务和实际应用。
https://arxiv.org/abs/2405.10175
In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at this https URL.
在这项工作中,我们引入了LIBRA,一个在大型语言模型(LLM)上具有解耦视觉系统的原型模型。解耦的视觉系统解耦了内模态建模和跨模态交互,产生了独特的视觉信息建模和有效的跨模态理解。LIBRA通过在视觉和语言输入上进行离散自回归建模进行训练。具体来说,我们将一个经过跨模态桥接模块的径向视觉专家融入预训练的LLM中,以路由在注意力计算过程中视觉和语言流的视觉和跨模态交互场景,实现不同内模态建模和跨模态交互场景的注意力模式。实验结果表明,专门设计的LIBRA在仅有5000万训练数据的情况下,实现了与现有图像到文本场景中工作的MLLM基线相媲美的强大性能,为未来的多模态基础模型提供了新的视角。代码可以从该链接获取:https://www.example.com/libra。
https://arxiv.org/abs/2405.10140
The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%.
在学习图像条件机器人策略的主要挑战是获取一个有助于低级控制的视觉表示。由于图像空间的维度很高,获得良好的视觉表示需要大量的视觉数据。然而,在现实世界中,数据很昂贵。Sim2Real 是一种有前途的方法,通过在真实世界目标领域使用模拟器收集与目标任务密切相关的大量廉价数据,来克服真实世界数据稀缺的问题。然而,在领域之间具有非常视觉差异时,从Sim到Real的图像条件策略很难进行迁移。为了弥合Sim2Real的视觉差距,我们提出使用图像的自然语言描述作为跨领域的统一信号来捕捉相关任务语义。我们的关键洞见是,如果来自不同领域的两个图像观察者被标记为相似的语言,策略应该预测两个图像的相似动作分布。我们证明了将图像编码器预训练为预测图像描述或描述的距离是一种有用的、数据有效的预训练步骤,可以帮助学习领域无关的图像表示。然后,我们可以将这个图像编码器作为同时训练大量模拟和几个真实演示的IL策略的基础。我们的方法在广泛使用的先验Sim2Real方法和强大的视觉语言预训练基线CLIP和R3M上分别提高了25%至40%。
https://arxiv.org/abs/2405.10020
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
自动医学图像分析系统通常需要大量高质量的训练数据,这很难且耗时。本文介绍了一种名为“Context v2”的Radiology Object(ROCOv2)多模态数据集,该数据集由来自PMC开放访问子集的放射性图像和相关医疗概念和注释组成。这是2018年发表的ROCO数据集中的更新版本,并增加了2018年以来在PMC上新增的35,705个图像。它进一步提供了含有人工编写的成像模式的概念,以及增加了X射线方面的解剖和方向概念。数据集包括79,789个图像,已在ImageCLEF medical caption 2023中的概念检测和预测任务中使用,尽管略微有所修改。该数据集还适用于基于图像-摘要对的训练图像注释模型,或使用每个图像提供的统一医疗语言系统(UMLS)概念进行多标签图像分类。此外,它还可以用于医学领域模型的预训练,以及评估用于多任务学习的深度学习模型。
https://arxiv.org/abs/2405.10004
In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at this https URL .
在语义通信(SC)的新范式中,重点是通过对原始数据提取语义信息来传递比特背后的含义。最近的数据到文本模型的进步促进了面向语言的SC,特别是通过图像到文本(I2T)编码和文本到图像(T2I)解码进行文本转图像通信。然而,尽管在语义上与原始数据对齐,但文本过于粗粒度,无法准确捕捉复杂的视觉特征,如空间位置、颜色和纹理,导致预想和重构图像之间的感知差异相当大。为了克服这一限制,本文提出了一种新颖的语言导向SC框架,将文本和压缩图像嵌入结合使用,并通过潜在扩散模型重构意图图像。实验结果证实了我们的方法具有潜在的可行性,即在仅传输原始图像大小2.09%的同时,在噪声通信通道中实现了与仅通过文本的基线SC方法更高的感知相似性。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.09976
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
我们提出了Chameleon模型,这是一款基于早期融合词素的多模态模型,可以理解并以任意顺序生成图像和文本。我们从一开始就概述了稳定训练方法、一个对齐配方和一种针对早期融合、词素基础、多模态设置的建筑参数化。这些模型在包括视觉问题回答、图像标题、文本生成、图像生成和长篇多模态生成在内的各种任务上进行了评估。Chameleon展示了广泛和普遍的能力,包括在图像标题任务中的最先进性能,在仅文本任务中超过了Llama-2的性能,同时与Mixtral 8x7B和Gemini-Pro等模型竞争,所有这些都在单个模型中实现。根据人类对长篇多模态生成评估的新鲜程度,Chameleon在包括图像和文本的混合序列的提示或输出上表现出色,甚至超过了Gemini Pro和GPT-4V等更大模型的性能。Chameleon在统一建模全多模态文本文档方面迈出了重要的一步。
https://arxiv.org/abs/2405.09818
Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
Medical image interpretation using deep learning has shown promise but often requires extensive expert-annotated datasets. To reduce this annotation burden, we develop an Image-Graph Contrastive Learning framework that pairs chest X-rays with structured report knowledge graphs automatically extracted from radiology notes. Our approach uniquely encodes the disconnected graph components via a relational graph convolution network and transformer attention. In experiments on the CheXpert dataset, this novel graph encoding strategy enabled the framework to outperform existing methods that use image-text contrastive learning in 1% linear evaluation and few-shot settings, while achieving comparable performance to radiologists. By exploiting unlabeled paired images and text, our framework demonstrates the potential of structured clinical insights to enhance contrastive learning for medical images. This work points toward reducing demands on medical experts for annotations, improving diagnostic precision, and advancing patient care through robust medical image understanding.
使用深度学习进行医学图像解释已经显示出很大的潜力,但通常需要大量专家标注的数据。为了减轻这一标注负担,我们开发了一个图像-图卷积学习框架,将胸部X光片与从放射科笔记中自动提取的结构化报告知识图进行配对。我们的方法通过关系图卷积网络和Transformer注意力独特地编码了断开图组件。在CheXpert数据集的实验中,这种新颖的图编码策略使得该框架在1%的线性评估和少样本设置的图像文本对比学习方法中超过了现有方法,同时实现了与放射科医生相当的表现。通过利用未标注的成对图像和文本,我们的框架展示了结构化临床见解增强医学图像对比学习潜力。这项工作有望减少对医学专家的标注需求,提高诊断准确性,并通过可靠的医学图像理解推动患者护理。
https://arxiv.org/abs/2405.09594
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Can web-based image processing and visualization tools easily integrate into existing websites without significant time and effort? Our Boostlet.js library addresses this challenge by providing an open-source, JavaScript-based web framework to enable additional image processing functionalities. Boostlet examples include kernel filtering, image captioning, data visualization, segmentation, and web-optimized machine-learning models. To achieve this, Boostlet.js uses a browser bookmark to inject a user-friendly plugin selection tool called PowerBoost into any host website. Boostlet also provides on-site access to a standard API independent of any visualization framework for pixel data and scene manipulation. Web-based Boostlets provide a modular architecture and client-side processing capabilities to apply advanced image-processing techniques using consumer-level hardware. The code is open-source and available.
基于Web的图像处理和可视化工具能否轻松地集成到现有的网站中而不需要大量的时间和精力?我们的Boostlet.js库通过提供一个基于JavaScript的开源框架来解决这一挑战,从而使图像处理功能得以增强。Boostlet示例包括内核滤波、图像标题、数据可视化、分割和针对网络优化的机器学习模型。为了实现这一点,Boostlet.js使用浏览器书签在任意主机网站上注入一个用户友好的插件选择工具,名为PowerBoost。Boostlet还提供了一个标准的API,使其具有与任何可视化框架(针对像素数据和场景操作)独立的本地访问。基于Web的Boostlet提供了一种模块化架构和客户端处理能力,使消费者水平硬件上应用高级图像处理技术。代码是开源的,可以获取。
https://arxiv.org/abs/2405.07868
We present SLIP (SAM+CLIP), an enhanced architecture for zero-shot object segmentation. SLIP combines the Segment Anything Model (SAM) \cite{kirillov2023segment} with the Contrastive Language-Image Pretraining (CLIP) \cite{radford2021learning}. By incorporating text prompts into SAM using CLIP, SLIP enables object segmentation without prior training on specific classes or categories. We fine-tune CLIP on a Pokemon dataset, allowing it to learn meaningful image-text representations. SLIP demonstrates the ability to recognize and segment objects in images based on contextual information from text prompts, expanding the capabilities of SAM for versatile object segmentation. Our experiments demonstrate the effectiveness of the SLIP architecture in segmenting objects in images based on textual cues. The integration of CLIP's text-image understanding capabilities into SAM expands the capabilities of the original architecture and enables more versatile and context-aware object segmentation.
我们提出了SLIP(SAM+CLIP),一种增强的零 shot 对象分割架构。SLIP将段 Anything 模型(SAM)与对比性语言图像预训练(CLIP)相结合。通过使用 CLIP 将文本提示集成到 SAM 中,SLIP 使得在没有特定类或类别的先验训练的情况下进行对象分割。我们在精灵宝可梦数据集上微调 CLIP,使其能够学习有意义的图像文本表示。SLIP证明了基于文本提示识别和分割图像中物体的能力,擴展了 SAM 的多樣化物體分割能力。我们的實驗證明了 SLIP 架構在基于文本提示分割图像物体方面的有效性。將 CLIP 的文本理解能力集成到 SAM 中擴展了原始架構的功能,並使更有多樣化和具有上下文意識的物體分割。
https://arxiv.org/abs/2405.07284
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.
在磁共振成像(MRI)中,图像采集通常在测量域中 undersampled 以加速扫描过程,但以牺牲图像质量为代价。然而,图像质量是影响临床诊断准确性至关重要的一个因素,因此,高质量的 undersampled 测量图像重建一直是一个研究热点。最近,深度学习(DL)方法已成为 MRI 重建的最新技术,通常涉及使用深度神经网络将 undersampled MRI 图像转换为高质量 MRI 图像,通过数据驱动的过程。然而,在 undersampled DL MRI 重建中,仍然存在显而易见的改进空间,以满足临床诊断的高标准,即消除混叠伪影并减少图像噪声。在本文中,我们使用对比学习引入自监督预训练程序来提高 undersampled DL MRI 重建的准确性。我们使用对比学习将 MRI 图像表示转换为具有最大相互信息的不同 undersampled 表示的潜在空间,并优化输入下游 DL 重建模型的信息内容。我们的实验结果表明,在不同的加速因子和数据集上,图像重建准确性得到了显著提高,无论是定量还是定性。此外,我们的扩展实验证明了所提出的框架在逆境条件下的鲁棒性,例如测量噪声、不同的 k-空间采样模式和病理性异常,以及证明其在具有完全不同解剖学结构的 MRI 数据上的迁移学习能力。此外,我们还进行了实验来可视化和分析所提出的 MRI 对比学习潜在空间的性质。
https://arxiv.org/abs/2306.00530
An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
文本分类模型开发的一个瓶颈是训练数据的标注需要,这对多语言分类器来说更是如此。幸运的是,当代机器翻译模型既易于获取,又具有可靠的翻译质量,因此可以将一个语言的带标签训练数据翻译到另一个语言。在这里,我们探讨了使用机器翻译对跨多个语言分类任务进行微调的效果。我们还研究了使用图像标题领域中提出的一种新型技术来解释模型对翻译数据潜在负面影响的益处。我们发现,翻译数据具有足够的质量,可以用于微调跨多个语言分类器,并且这种新型损失技术能够提供比没有它更好的效果。
https://arxiv.org/abs/2405.05478
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
在本文中,我们提出了LOC-ZSON,一种新的面向对象的图像表示方法,用于解决复杂场景中的目标导航任务。我们提出了一个以对象为中心的图像表示和相应的损失,用于视觉语言模型(VLM)的微调,可以处理复杂的对象级查询。此外,我们还设计了一种新的LLM-基的增强和提示模板,用于训练和零散推理中的稳定性。我们将在Astro机器人上实现我们的方法,并在模拟和现实世界的环境中进行零散对象导航。我们证明了我们的方法在检索任务中的文本到图像召回率上可以实现1.38 - 13.38%的改进。对于对象导航,我们在模拟和现实世界中展示了我们的方法的益处,分别实现了5%和16.67%的导航成功率的增长。
https://arxiv.org/abs/2405.05363
The diagnosis and treatment of chest diseases play a crucial role in maintaining human health. X-ray examination has become the most common clinical examination means due to its efficiency and cost-effectiveness. Artificial intelligence analysis methods for chest X-ray images are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical dissemination. Here we present EVA-X, an innovative foundational model based on X-ray images with broad applicability to various chest disease detection tasks. EVA-X is the first X-ray image based self-supervised learning method capable of capturing both semantic and geometric information from unlabeled images for universal X-ray image representation. Through extensive experimentation, EVA-X has demonstrated exceptional performance in chest disease analysis and localization, becoming the first model capable of spanning over 20 different chest diseases and achieving leading results in over 11 different detection tasks in the medical field. Additionally, EVA-X significantly reduces the burden of data annotation in the medical AI field, showcasing strong potential in the domain of few-shot learning. The emergence of EVA-X will greatly propel the development and application of foundational medical models, bringing about revolutionary changes in future medical research and clinical practice. Our codes and models are available at: this https URL.
胸部疾病的诊断和治疗在维护人类健康中起着关键作用。X光检查由于其高效和成本效益成为最常用的临床检查手段。用于胸部X光图像的人工智能分析方法由于缺乏注释数据和注释水平的差异,导致泛化能力差和临床传播困难。在这里,我们介绍了EVA-X,一种基于胸部X光图像的创新基础模型,具有广泛的应用各种胸部疾病检测任务的能力。EVA-X是第一个基于无标签图像捕捉语义和几何信息的全局X光图像自监督学习方法。通过广泛的实验,EVA-X在胸部疾病分析和定位方面表现出色,成为第一个在医疗领域跨越20种不同胸部疾病并实现11个不同检测任务领先结果的模型。此外,EVA-X显著减轻了医学人工智能领域数据注释的负担,展示了在领域内的少量样本学习巨大潜力。EVA-X的出现将极大地推动基础医疗模型的开发和应用,带来未来医学研究和临床实践的革命性变化。我们的代码和模型可在此处访问:https://this URL。
https://arxiv.org/abs/2405.05237
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
有效的图像分类依赖于从前景和背景元素中辨别相关特征,通常前景持有关键信息。虽然人类在有限曝光下也能够分类图像,但人工神经网络通常在从罕见样本中选择特征时遇到困难。为了应对这个挑战,我们提出了一种选择类相关补丁嵌入的新方法。我们的方法将支持性和查询图像分割成补丁,并使用预训练的Vision Transformer(ViT)对其进行编码,分别获得类嵌入和补丁嵌入。接下来,我们使用类嵌入过滤补丁嵌入,保留只有类相关的补丁。对于每个图像,我们计算类嵌入与每个补丁嵌入之间的相似度,将相似度序列按下降顺序排序,并仅保留排名靠前的补丁嵌入。通过优先考虑类嵌入与补丁嵌入之间的相似性,我们选择排名靠前的补丁嵌入与类嵌入融合,形成全面图像表示,增强模式识别。通过有效地减轻类无关补丁嵌入的影响,我们的策略在预训练模型上产生了改进。在流行的小样本分类基准上进行广泛的实验,证明了我们的方法的简单性、有效性和计算效率,在5-shot和1-shot场景下均优于最先进的基线。
https://arxiv.org/abs/2405.03722
Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.
尽管谚语“图片价值千言万语”,为训练视觉语言模型创建准确且超详细图像描述仍然具有挑战性。通常的数据集包含从互联网上抓取的简短、低粒度和通常与视觉内容无关的描述。因此,训练在这些数据上的模型生成的描述充满了缺失信息、视觉不一致性和幻觉。为了应对这些问题,我们引入了ImageInWords (IIW),一种由人类和机器共同审核的超详细图像描述的元数据框架和一个由此过程产生的新数据集。我们通过关注数据集质量和其在微调方面的可用性、可读性、全面性、幻觉和人性化来评估该框架。我们的数据在这些方面显著优于最近发布的数据集(+66%)和GPT-4V输出(+48%)。此外,使用IIW数据进行微调的模型在相同的人类评价维度上表现优异,与之前的工作相比提高了31%。鉴于我们微调的模型,我们还评估了文本到图像生成和视觉语言推理。我们的模型描述可以生成与原始图像最接近的图像,这由自动和人类指标进行评估。我们发现,我们的模型在ARO、SVO-Probes和Winoground数据集上产生了更丰富的描述,比最佳基线高出6%以上。
https://arxiv.org/abs/2405.02793