Referring image segmentation (RIS) aims to locate the particular region corresponding to the language expression. Existing methods incorporate features from different modalities in a \emph{bottom-up} manner. This design may get some unnecessary image-text pairs, which leads to an inaccurate segmentation mask. In this paper, we propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism and uses the parameter-efficient fine-tuning (PEFT) framework. To be specific, the Human-Like Attention gets a \emph{feedback} signal from multi-modal features, which makes the network center on the specific objects and discard the irrelevant image-text pairs. Besides, we introduce the PEFT framework to preserve the zero-shot ability of pre-trained encoders. Extensive experiments on three widely used RIS benchmarks and the PhraseCut dataset demonstrate that our method achieves state-of-the-art performance and great zero-shot ability.
参考图像分割(RIS)旨在定位语言表达的特定区域。现有的方法以自下而上的方式结合了不同模态的特征。这种设计可能会有多余的图像-文本对,导致不准确的分割掩码。在本文中,我们提出了一种称为HARIS的参考图像分割方法,它引入了人类注意机制并使用了参数高效的微调(PEFT)框架。具体来说,人类注意从多模态特征中获取反馈信号,使网络集中于特定的对象并丢弃无关的图像-文本对。此外,我们还引入了PEFT框架来保留预训练编码器的零样本能力。通过对三个广泛使用的RIS基准和PhraseCut数据集的实验,证明了我们的方法达到了最先进的性能和极大的零样本能力。
https://arxiv.org/abs/2405.10707
In the rapidly evolving field of business process management, there is a growing need for analytical tools that can transform complex data into actionable insights. This research introduces a novel approach by integrating Large Language Models (LLMs), such as ChatGPT, into process mining tools, making process analytics more accessible to a wider audience. The study aims to investigate how ChatGPT enhances analytical capabilities, improves user experience, increases accessibility, and optimizes the architectural frameworks of process mining tools. The key innovation of this research lies in developing a tailored prompt engineering strategy for each process mining submodule, ensuring that the AI-generated outputs are accurate and relevant to the context. The integration architecture follows an Extract, Transform, Load (ETL) process, which includes various process mining engine modules and utilizes zero-shot and optimized prompt engineering techniques. ChatGPT is connected via APIs and receives structured outputs from the process mining modules, enabling conversational interactions. To validate the effectiveness of this approach, the researchers used data from 17 companies that employ BehfaLab's Process Mining Tool. The results showed significant improvements in user experience, with an expert panel rating 72% of the results as "Good". This research contributes to the advancement of business process analysis methodologies by combining process mining with artificial intelligence. Future research directions include further optimization of prompt engineering, exploration of integration with other AI technologies, and assessment of scalability across various business environments. This study paves the way for continuous innovation at the intersection of process mining and artificial intelligence, promising to revolutionize the way businesses analyze and optimize their processes.
在快速发展的业务流程管理领域,需要越来越多的分析工具将复杂数据转化为可操作的见解。这项研究通过将大型语言模型(如ChatGPT)集成到流程挖掘工具中,使流程分析更加易于使用,让更多人能够使用。该研究旨在调查ChatGPT如何增强分析能力、提高用户体验、增加可访问性以及优化流程挖掘工具的架构框架。这项研究的关键创新在于为每个流程挖掘子模块开发定制化的提示工程策略,确保AI生成的输出准确相关且有意义。整合架构采用ETL过程,包括各种流程挖掘引擎模块,并利用零 shot和优化提示工程技术。ChatGPT通过API与流程挖掘模块进行连接,并接收结构化输出,实现对话交互。为了验证这种方法的有效性,研究人员使用了17家使用BehfaLab流程挖掘工具的公司。结果显示,用户体验得到了显著提高,专家小组认为72%的结果为“好”。这项研究通过将流程挖掘与人工智能相结合,为业务流程分析方法论的进步做出了贡献。未来的研究方向包括进一步优化提示工程、探索与其他AI技术的集成以及评估各种业务环境的可扩展性。这项研究为在流程挖掘和人工智能交叉领域持续创新铺平了道路,有望彻底改变企业如何分析和优化其流程。
https://arxiv.org/abs/2405.10689
Large Language Models (LLMs) have transformed NLP with their remarkable In-context Learning (ICL) capabilities. Automated assistants based on LLMs are gaining popularity; however, adapting them to novel tasks is still challenging. While colossal models excel in zero-shot performance, their computational demands limit widespread use, and smaller language models struggle without context. This paper investigates whether LLMs can generalize from labeled examples of predefined tasks to novel tasks. Drawing inspiration from biological neurons and the mechanistic interpretation of the Transformer architecture, we explore the potential for information sharing across tasks. We design a cross-task prompting setup with three LLMs and show that LLMs achieve significant performance improvements despite no examples from the target task in the context. Cross-task prompting leads to a remarkable performance boost of 107% for LLaMA-2 7B, 18.6% for LLaMA-2 13B, and 3.2% for GPT 3.5 on average over zero-shot prompting, and performs comparable to standard in-context learning. The effectiveness of generating pseudo-labels for in-task examples is demonstrated, and our analyses reveal a strong correlation between the effect of cross-task examples and model activation similarities in source and target input tokens. This paper offers a first-of-its-kind exploration of LLMs' ability to solve novel tasks based on contextual signals from different task examples.
大规模语言模型(LLMs)通过其令人惊叹的上下文学习(ICL)能力,已经彻底颠覆了自然语言处理(NLP)。基于LLM的自动助手正逐渐受到欢迎;然而,将其适应新任务仍然具有挑战性。虽然巨大的模型在零散shot表现方面表现出色,但它们的计算需求仍然很大,而且小语言模型在没有上下文的情况下会挣扎。本文研究了LLMs是否可以从预定义任务的标注示例中泛化到新任务。从生物神经元和Transformer架构的机械解释中汲取灵感,我们探讨了跨任务信息共享的潜力。我们设计了一个带有三个LLM的跨任务提示设置,并展示了LLMs在没有任何目标任务上下文样本的情况下取得了显著的性能提升。跨任务提示平均提高了LLLaMA-2 7B 107%、LLLaMA-2 13B 18.6%和GPT 3.5的性能,与标准在上下文中的学习效果相当。为在任务示例上生成伪标签的效果进行了证明,我们的分析揭示了源和目标输入词对模型激活相似性的强烈相关。本文是对LLMs基于不同任务上下文信号解决新任务的独特能力的第一次深入探索。
https://arxiv.org/abs/2405.10548
The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. In this paper, we introduce PropertyExtractor, an open-source tool that leverages advanced conversational LLMs like Google Gemini-Pro and OpenAI GPT-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies, enabling autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data. Our tests on material data demonstrate precision and recall exceeding 93% with an error rate of approximately 10%, highlighting the effectiveness and versatility of the toolkit. We apply PropertyExtractor to generate a database of 2D material thicknesses, a critical parameter for device integration. The rapid evolution of the field has outpaced both experimental measurements and computational methods, creating a significant data gap. Our work addresses this gap and showcases the potential of PropertyExtractor as a reliable and efficient tool for the autonomous generation of diverse material property databases, advancing the field.
自然语言处理和大型语言模型的出现已经彻底颠覆了从非结构化学术论文中提取数据的方式。然而,确保数据可靠性仍然是一个重要的挑战。在这篇论文中,我们介绍了PropertyExtractor,一个开源工具,它利用先进的会话LLM(如Google Gemini-Pro和OpenAI GPT-4)融合了零样本和少样本的上下文学习,并使用工程化提示对结构化信息的动态细化进行处理,从而实现自主、高效、可扩展和准确的物质属性数据提取、提取和验证。我们对材料数据的研究表明,精度误差率不到10%,证明了工具包的有效性和多样性。我们将PropertyExtractor应用于生成2D材料厚度的数据库,这是器件集成中至关重要的一项参数。该领域的快速演变已经超出了实验测量和计算方法的范畴,造成了显著的数据缺口。我们的工作填补了这一缺口,并展示了PropertyExtractor作为可靠且高效的自动生成多样化物质属性数据库的工具的潜力,推动了该领域的发展。
https://arxiv.org/abs/2405.10448
The infrequency and heterogeneity of clinical presentations in rare diseases often lead to underdiagnosis and their exclusion from structured datasets. This necessitates the utilization of unstructured text data for comprehensive analysis. However, the manual identification from clinical reports is an arduous and intrinsically subjective task. This study proposes a novel hybrid approach that synergistically combines a traditional dictionary-based natural language processing (NLP) tool with the powerful capabilities of large language models (LLMs) to enhance the identification of rare diseases from unstructured clinical notes. We comprehensively evaluate various prompting strategies on six large language models (LLMs) of varying sizes and domains (general and medical). This evaluation encompasses zero-shot, few-shot, and retrieval-augmented generation (RAG) techniques to enhance the LLMs' ability to reason about and understand contextual information in patient reports. The results demonstrate effectiveness in rare disease identification, highlighting the potential for identifying underdiagnosed patients from clinical notes.
罕见疾病的临床表现通常不规律且异质,这导致其往往被误诊和排除在结构化数据之外。因此,本研究提出了一种新颖的混合方法,将传统字典为基础的自然语言处理(NLP)工具与大型语言模型的强大功能相结合,以增强从无结构临床笔记中识别罕见疾病的能力。我们全面评估了六种不同大小和领域的LLM(大型语言模型)的 various提示策略。这次评估包括 zero-shot、few-shot 和检索增强生成(RAG)技术,以增强LLM对患者报告中的推理和理解上下文信息的能力。结果显示,该方法在识别罕见疾病方面非常有效,揭示了从临床笔记中识别未被诊断的患者潜力。
https://arxiv.org/abs/2405.10440
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at this https URL
本文介绍了IDEA研究开发的高级开放集物体检测模型Grounding DINO 1.5,该模型的目标是提高开放集物体检测的“边缘”。该系列包括两个模型:Grounding DINO 1.5 Pro,一种高性能模型,旨在在广泛的场景中提高泛化能力,以及Grounding DINO 1.5 Edge,一种专注于更快速度要求的许多需要边缘部署的应用程序的低延迟模型。Grounding DINO 1.5 Pro模型通过扩展模型架构、集成增强的视觉骨架和扩展训练数据集(带有 grounding 注释的超过2000万图像)来超越其前辈,从而实现更丰富的语义理解。Grounding DINO 1.5 Edge模型,虽然设计为具有较低特征缩放的高效模型,但在全面的数据集上训练仍具有稳健的检测能力。实验结果证明了Grounding DINO 1.5的有效性,Grounding DINO 1.5 Pro模型在COCO检测基准上获得了54.3的AP,在LVIS-minival零散转移基准上获得了55.7的AP,创造了新的开放集物体检测纪录。此外,当使用TensorRT优化Grounding DINO 1.5 Edge模型时,其速度达到75.2 FPS,同时取得了LVIS-minival基准上的零散转移性能为36.2 AP,使其更适合边缘计算场景。模型示例和API演示将会在这个链接上发布。
https://arxiv.org/abs/2405.10300
Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
翻译:计算病理学中的基础模型承诺要解锁精确医学中新的临床决策支持系统和模型的开发。然而,大多数临床分析(以一或多个整片图像的水平定义)与现有基础模型之间的差距很大,后者在整片图像中处理成千上万的图像块。在训练网络以跨越大量 tile 的跨多个整片图像的信息方面,这些模型的影响受到限制。在这项工作中,我们提出了一个用于 H&E 染色组织学的基础模型,PRISM,它基于 Virchow 图像块嵌入并利用临床报告文本进行预训练。利用块嵌入,PRISM 产生具有生成临床报告能力的多模式使用。利用文本提示,PRISM 实现零散癌症检测和亚型检测性能,其效果接近并超过了有监督聚合器的模型。使用线性分类器的块嵌入,PRISM 超越了有监督聚合器模型。此外,我们还证明了 PRISM 块编码器的微调使得生物标志物预测的标签效率训练,这种任务通常缺乏足够的训练数据。基于 PRISM 的聚合器初始化并仅使用 10% 的训练数据进行训练,可以击败使用所有数据的开环基线模型。
https://arxiv.org/abs/2405.10254
The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at this https URL
大规模语言模型的快速演变带来了对它们在各种维度上的表现进行全面评估的需求。在本文中,我们提出了LFED,一个文学小说评估数据集,旨在评估LLM在长篇小说理解和推理方面的能力。我们收集了95部文学作品,这些作品要么最初用中文创作,要么翻译成中文,涵盖了几个世纪内的各种主题。我们定义了一个问题分类器,包括8个问题类别,以指导创建1304个问题。此外,我们进行了深入分析,以确定文学作品的具体属性(如小说类型、角色数量、出版年份)如何影响LLM在评估中的表现。通过一系列与最先进的LLM的实验,我们发现这些模型在有效地回答关于文学小说的相关问题时面临相当大的挑战,在零散设置下,ChatGPT的得分只有57.08%。该数据集将在这个https:// URL上公开发布。
https://arxiv.org/abs/2405.10166
We introduce LatentTimePFN (LaT-PFN), a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. To achieve this, we perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks. We leverage the JEPA framework to create a prediction-optimized latent representation of the underlying stochastic process that generates time series and combines it with contextual learning, using a PFN. Furthermore, we improve on preceding works by utilizing related time series as a context and introducing an abstract time axis. This drastically reduces training time and increases the versatility of the model by allowing any time granularity and forecast horizon. We show that this results in superior zero-shot predictions compared to established baselines. We also demonstrate our latent space produces informative embeddings of both individual time steps and fixed-length summaries of entire series. Finally, we observe the emergence of multi-step patch embeddings without explicit training, suggesting the model actively learns discrete tokens that encode local structures in the data, analogous to vision transformers.
我们提出了 LatentTimePFN(LaT-PFN),一种基本的时间序列模型,具有强大的嵌入空间,可以实现零 shots 预测。为了实现这一目标,我们在潜在空间中进行本地的学习,利用新颖的集成 Prior-data Fitted Networks(PFN)和联合嵌入预测架构(JEPA)框架。我们利用 JEPA 框架创建了生成时间序列的预测优化潜在表示,将上下文学习和 PFN 相结合。此外,我们在前人工作中取得了改进,通过将相关的时间序列作为上下文,并引入了一个抽象的时间轴。这极大地减少了训练时间,并增加了模型的灵活性,允许了任何时间粒度和预测视野。我们证明了这种方法在零 shots 预测方面优于现有的基线。 我们还证明了我们的潜在空间产生了有关单个时间步和整个系列固定长度的摘要的有用嵌入。最后,我们观察到在无需显式训练的情况下,出现了多级补丁嵌入,这表明模型在数据中主动学习编码数据局部结构的离散标记,类似于视觉 Transformer。
https://arxiv.org/abs/2405.10093
The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at this https URL
学习匹配(LTM)框架被证明是一种有效的反向最优传输方法,用于在两个数据源之间学习潜在的地面度量,从而促进后续匹配。然而,传统的LTM框架面临可扩展性挑战,每次更新地面度量参数时,需要使用整个数据集。将LTM适应深度学习场景,我们引入了音频-文本检索问题的小批次学习匹配(m-LTM)框架。该框架利用了 mini-batch 子采样和 Mahalanobis 增强的地面度量家族。此外,为了应对实际实践中存在的训练数据对齐问题,我们提出了一个使用部分最优传输的变体,以减轻对齐数据对训练数据的影响。我们在三个数据集(AudioCaps、Clotho和ESC-50)上对音频-文本匹配问题进行了广泛的实验。结果表明,我们提出的方法可以学习丰富和富有表现力的联合嵌入空间,实现最佳性能。此外,与仅基于对比损失的零 shot 声事件检测任务相比,所提出的 m-LTM 框架在AudioCaps数据集上的模态差距可以实现更大的提升。值得注意的是,我们使用部分最优传输与 m-LTM 的策略表明,与对比损失相比,噪声容忍度更高,特别是在 AudioCaps 数据集上,训练数据中的噪声比值变化时。我们的代码可以从该链接https://www.oskari.org/es/docs/latest/html/index.html获取。
https://arxiv.org/abs/2405.10084
Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
自然语言在开发通用手术模型方面发挥了重要作用,因为它可以提供从原始文本的广泛监督。这种灵活的监督形式可以确保模型在数据和任务上的可转移性,因为自然语言可以用于参考学到的视觉概念或描述新的概念。在这项工作中,我们提出了HecVL,一种用于构建通用手术模型的分层视频语言预训练方法。具体来说,我们通过将手术讲座视频与三个层次的文本(剪辑级别、音频转录文本级别和视频级别)进行配对,构建了一个分层的视频文本对数据集。然后,我们提出了一个新颖的细到粗的对比学习框架,使用单个模型学习三个视频文本层次的嵌入空间。通过分离不同层次的嵌入空间,学习到的多模态表示编码了同一模型中的短期和长期手术概念。由于引入了文本语义,我们证明了HecVL方法可以在没有任何人类注释的情况下实现零散手术阶段识别。此外,我们还证明了用于手术阶段识别的HecVL模型可以应用于不同的手术过程和医疗机构。
https://arxiv.org/abs/2405.10075
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
开放词汇对象检测(OvOD)使检测任务变成了一种语言指导的任务,使用户在推理过程中可以自由定义他们感兴趣的类词汇。然而,我们的初步调查表明,现有的OvOD检测器在处理不同语义粒度词汇时表现出显著的变异性,这可能会对现实世界的部署造成担忧。为此,我们引入了语义层次结构 Nexus(SHiNe),一种新类器,它使用类层次结构的语义知识。它通过三个步骤运行:i)它从层次结构中检索与目标类别相关的超/子类别;ii)它将这些类别整合到等级感知句子中;iii)它将句子嵌入融合生成nexus分类器向量。我们对各种检测基准的评估表明,SHiNe在不同的词汇粒度下增强了鲁棒性,达到+31.9%的mAP50,同时保留了使用大语言模型生成的等级所取得的有益改进。此外,当应用于ImageNet-1k上的开放词汇分类时,SHiNe提高了CLIP零散 baseline的准确率+2.8%。SHiNe是免费的训练的,可以轻松地与任何现有的OvOD检测器集成,而不会在推理过程中产生额外的计算开销。代码是开源的。
https://arxiv.org/abs/2405.10053
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (\textit{Common Procurement Vocabulary}, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from \url{this http URL}, a service by \href{this https URL}{SpazioDati s.r.l}. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.
对公共招标进行分类是一项有益的任务,对于受邀参加的公司和检查欺诈活动都很有用。为了方便企业和公共行政机构的参与,欧盟制定了一个共同招标词汇表(CPV),对于某些重要性的招标是强制性的;然而,强制要求CPV标签的合同数量相对于所有公共行政机构的活动来说只是少数。将分类扩展到现实世界的分类框架中引入了一些困难,不容忽视。首先,一些细粒度分类在训练集中的观察数量不足(如果有的话),而其他分类则比平均观察数量要得多(甚至几千倍)。为了克服这些困难,我们提出了一个零击中方法,基于一个预训练语言模型,仅依赖标签描述并尊重标签分类。为了训练我们所提出的模型,我们使用了工业数据,该数据来自于意大利过去25年内颁布的公共合同的汇总服务。结果表明,与三种不同的基线相比,所提出的模型在分类低频类别的表现更好,同时还能预测未见过的类别。
https://arxiv.org/abs/2405.09983
Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
目前,大部分现有的注意力预测研究都集中在显眼的实例,如人类和物体。然而,更加复杂的关系型注意力,即通过人类观察者理解实例之间互动所产生的注意力,仍然没有被深入研究。这对于促进人与机器之间的交互和人类为中心的人工智能发展至关重要。为了填补这一空白,我们首先收集了一个名为IG的新 gaze 固定点数据集,包括740个不同交互类别的530,000个固定点,记录了人类观察者在互动过程中的视觉注意力。接着,我们引入了零击关系型注意力预测任务ZeroIA,该任务挑战模型预测在训练过程中未见过的视觉线索。第三,我们提出了交互注意力模型IA,旨在模仿人类观察者的认知过程解决 ZeroIA 问题。大量实验证明,与最先进的零击和完全监督方法相比,所提出的IA在ZeroIA和完全监督设置中都表现出色。最后,我们努力将关系型注意力应用于交互识别任务本身。进一步的实验结果表明,通过将IG和IA生成的真实人类注意力数据以及注意力标签相结合,可以增强现有最先进的HOI模型的性能和可解释性。
https://arxiv.org/abs/2405.09931
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at this https URL .
大语言模型在少样本在上下文学习中(ICL)方面已经被证明非常有效。最近多模态基础模型的进步使得以前未曾有过的长上下文窗口成为可能,这为我们研究其在一个多模态基础模型中进行 ICL 时表现的能力提供了机会。在这项工作中,我们评估了多模态基础模型从少样本到多样本 ICL 的性能。我们在包括多个领域的多个数据集(自然图像、医学图像、遥感图像和分子图像)和任务(多分类、多标签和细粒度分类)中进行了基准测试。我们观察到,许多样本在 ICL 中,包括多达几乎 2,000 个多模态示例,比少样本(<100 个示例) ICL 在所有数据集上产生了显着改进。此外,Gemini 1.5 Pro 的性能在许多数据集上呈对数线性增长,直到达到最大测试样本数。考虑到所需长请求的推理成本,我们还研究了在单个 API 调用中批注多个查询对性能的影响。我们发现,批注多达 50 个查询可以提高零样本和多样本 ICL 的性能,在多个数据集上的零样本设置中实现显著的收益,而大大降低每个查询的代价和延迟。最后,我们测量了模型的 ICL 数据效率,即模型从更多示例中学习的速率。我们发现,尽管GPT-4o 和 Gemini 1.5 Pro 在各个数据集上的零样本性能类似,但Gemini 1.5 Pro在大多数数据集上的 ICL 数据效率要高于GPT-4o。我们的结果表明,许多样本在 ICL 中可以帮助用户有效地将多模态基础模型适应到新的应用和领域。我们的代码库公开可用,在这个链接 https:// 。
https://arxiv.org/abs/2405.09798
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Autonomous systems often encounter environments and scenarios beyond the scope of their training data, which underscores a critical challenge: the need to generalize and adapt to unseen scenarios in real time. This challenge necessitates new mathematical and algorithmic tools that enable adaptation and zero-shot transfer. To this end, we leverage the theory of function encoders, which enables zero-shot transfer by combining the flexibility of neural networks with the mathematical principles of Hilbert spaces. Using this theory, we first present a method for learning a space of dynamics spanned by a set of neural ODE basis functions. After training, the proposed approach can rapidly identify dynamics in the learned space using an efficient inner product calculation. Critically, this calculation requires no gradient calculations or retraining during the online phase. This method enables zero-shot transfer for autonomous systems at runtime and opens the door for a new class of adaptable control algorithms. We demonstrate state-of-the-art system modeling accuracy for two MuJoCo robot environments and show that the learned models can be used for more efficient MPC control of a quadrotor.
自主系统通常会面临其训练数据范围之外的环境和场景,这凸显了一个关键挑战:需要在实时情况下对未见过的场景进行泛化和适应。这个挑战需要新的数学和算法工具来实现适应和零样本转移。为此,我们利用函数编码器的理论,该理论通过结合神经网络的灵活性和Hilbert空间数学原理来实现零样本转移。使用这个理论,我们首先提出了一种学习由一组神经ODE基础函数组成的动态空间的方法。训练后,所提出的方法可以迅速地在学习到的空间中识别出动态。关键的是,这个计算在在线阶段不需要梯度计算或重新训练。这种方法使得自主系统在实时情况下实现零样本转移,并为新的自适应控制算法打开了大门。我们用两个MuJoCo机器人环境证明了最先进的系统建模精度,并表明所学习到的模型可以用于更有效的MPC控制四旋翼。
https://arxiv.org/abs/2405.08954
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911