Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
大语言模型(LLMs)具有广泛的用途,可以处理许多任务,但在计算效率方面,通常希望将它们的功能缩小为较小的学生模型。为分类任务进行数据合成的一种方式是通过生成每个标签的示例来缩小LLM的功能。之前的方法使用少样本提示,这依赖于LLM的参数化知识来生成有用的示例。然而,这导致了重复问题、倾向于流行实体和人文风格的差异等问题。在本文中,我们提出了一种名为“合成-检索和精炼”(SynthesizRR)的方法,该方法通过检索增强来引入数据合成过程的多样性:由于检索到的段落有所不同,LLM会“播种”不同的内容以生成其示例。我们通过研究六个主题分类、情感分析、语调检测和幽默等领域的数据合成,探讨了合成策略的复杂性。我们发现,SynthesizRR在比较标准32- shot提示和六个基线方法时,大大提高了词汇和语义多样性、与人类文本的相似性和去耦性能。
https://arxiv.org/abs/2405.10040
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at this https URL .
大语言模型在少样本在上下文学习中(ICL)方面已经被证明非常有效。最近多模态基础模型的进步使得以前未曾有过的长上下文窗口成为可能,这为我们研究其在一个多模态基础模型中进行 ICL 时表现的能力提供了机会。在这项工作中,我们评估了多模态基础模型从少样本到多样本 ICL 的性能。我们在包括多个领域的多个数据集(自然图像、医学图像、遥感图像和分子图像)和任务(多分类、多标签和细粒度分类)中进行了基准测试。我们观察到,许多样本在 ICL 中,包括多达几乎 2,000 个多模态示例,比少样本(<100 个示例) ICL 在所有数据集上产生了显着改进。此外,Gemini 1.5 Pro 的性能在许多数据集上呈对数线性增长,直到达到最大测试样本数。考虑到所需长请求的推理成本,我们还研究了在单个 API 调用中批注多个查询对性能的影响。我们发现,批注多达 50 个查询可以提高零样本和多样本 ICL 的性能,在多个数据集上的零样本设置中实现显著的收益,而大大降低每个查询的代价和延迟。最后,我们测量了模型的 ICL 数据效率,即模型从更多示例中学习的速率。我们发现,尽管GPT-4o 和 Gemini 1.5 Pro 在各个数据集上的零样本性能类似,但Gemini 1.5 Pro在大多数数据集上的 ICL 数据效率要高于GPT-4o。我们的结果表明,许多样本在 ICL 中可以帮助用户有效地将多模态基础模型适应到新的应用和领域。我们的代码库公开可用,在这个链接 https:// 。
https://arxiv.org/abs/2405.09798
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
Medical image interpretation using deep learning has shown promise but often requires extensive expert-annotated datasets. To reduce this annotation burden, we develop an Image-Graph Contrastive Learning framework that pairs chest X-rays with structured report knowledge graphs automatically extracted from radiology notes. Our approach uniquely encodes the disconnected graph components via a relational graph convolution network and transformer attention. In experiments on the CheXpert dataset, this novel graph encoding strategy enabled the framework to outperform existing methods that use image-text contrastive learning in 1% linear evaluation and few-shot settings, while achieving comparable performance to radiologists. By exploiting unlabeled paired images and text, our framework demonstrates the potential of structured clinical insights to enhance contrastive learning for medical images. This work points toward reducing demands on medical experts for annotations, improving diagnostic precision, and advancing patient care through robust medical image understanding.
使用深度学习进行医学图像解释已经显示出很大的潜力,但通常需要大量专家标注的数据。为了减轻这一标注负担,我们开发了一个图像-图卷积学习框架,将胸部X光片与从放射科笔记中自动提取的结构化报告知识图进行配对。我们的方法通过关系图卷积网络和Transformer注意力独特地编码了断开图组件。在CheXpert数据集的实验中,这种新颖的图编码策略使得该框架在1%的线性评估和少样本设置的图像文本对比学习方法中超过了现有方法,同时实现了与放射科医生相当的表现。通过利用未标注的成对图像和文本,我们的框架展示了结构化临床见解增强医学图像对比学习潜力。这项工作有望减少对医学专家的标注需求,提高诊断准确性,并通过可靠的医学图像理解推动患者护理。
https://arxiv.org/abs/2405.09594
In sophisticated existing Text-to-SQL methods exhibit errors in various proportions, including schema-linking errors (incorrect columns, tables, or extra columns), join errors, nested errors, and group-by errors. Consequently, there is a critical need to filter out unnecessary tables and columns, directing the language models attention to relevant tables and columns with schema-linking, to reduce errors during SQL generation. Previous approaches have involved sorting tables and columns based on their relevance to the question, selecting the top-ranked ones for sorting, or directly identifying the necessary tables and columns for SQL generation. However, these methods face challenges such as lengthy model training times, high consumption of expensive GPT-4 tokens in few-shot prompts, or suboptimal performance in schema linking. Therefore, we propose an inventive schema linking method in two steps: Firstly, generate an initial SQL query by utilizing the complete database schema. Subsequently, extract tables and columns from the initial SQL query to create a concise schema. Using CodeLlama-34B, when comparing the schemas obtained by mainstream methods with ours for SQL generation, our schema performs optimally. Leveraging GPT4, our SQL generation method achieved results that are comparable to mainstream Text-to-SQL methods on the Spider dataset.
现有的Text-to-SQL方法在各种比例上存在错误,包括模式链接错误(错误的列、表或附加列)、连接错误、嵌套错误和分组错误。因此,有必要过滤出不必要的表和列,将语言模型的注意力引导到与模式链接相关的表和列上,以在SQL生成过程中减少错误。之前的解决方案涉及根据问题相关性对表和列进行排序,选择排名最高的进行排序,或者直接识别生成SQL所需的表和列。然而,这些方法面临诸如模型训练时间漫长、在少数shot提示中消耗昂贵GPT-4令牌或架构链接效果不佳等问题。因此,我们提出了一个创新的分层架构链接方法:首先,利用完整的数据库模式生成初始SQL查询。然后,从初始SQL查询中提取表和列创建简洁的架构。利用CodeLlama-34B,当我们将主流方法获得的架构与我们的架构进行比较时,我们的架构表现最佳。利用GPT4,我们的SQL生成方法在Spider数据集上的结果与主流Text-to-SQL方法相当。
https://arxiv.org/abs/2405.09593
Few-shot segmentation (FSS) aims to train a model which can segment the object from novel classes with a few labeled samples. The insufficient generalization ability of models leads to unsatisfactory performance when the models lack enough labeled data from the novel classes. Considering that there are abundant unlabeled data available, it is promising to improve the generalization ability by exploiting these various data. For leveraging unlabeled data, we propose a novel method, named Image to Pseudo-Episode (IPE), to generate pseudo-episodes from unlabeled data. Specifically, our method contains two modules, i.e., the pseudo-label generation module and the episode generation module. The former module generates pseudo-labels from unlabeled images by the spectral clustering algorithm, and the latter module generates pseudo-episodes from pseudo-labeled images by data augmentation methods. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate that our method achieves the state-of-the-art performance for FSS.
少样本分割(FSS)旨在训练一个模型,它可以从新的类别中分割出对象,只需几篇有标签的样本。模型缺乏足够的泛化能力会导致模型在新类上表现不满意,当模型缺乏足够的有标签数据时。考虑到有丰富的未标记数据可用,通过利用这些各种数据来提高泛化能力会是一个有前景的方法。为了利用未标记数据,我们提出了一种新的方法,名为图像到伪序列(IPE),从未标记数据中生成伪序列。具体来说,我们的方法包含两个模块,即伪标签生成模块和序列生成模块。前一个模块通过谱聚类算法从未标记图像中生成伪标签,后一个模块通过数据增强方法从伪标签图像中生成伪序列。在PASCAL-$5^i$和COCO-$20^i$等大量实验中,我们的方法证明了在FSS方面达到了最先进的性能。
https://arxiv.org/abs/2405.08765
Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
由于未见类别的标签信息有限,少样本分割仍然具有挑战性。之前的方法主要依赖于从冻结的视觉编码器中提取高层次特征图来计算像素级的相似度作为解码器的关键先验指导。然而,这种先验表示存在粗粒度不准确性和对新生类别的泛化能力差,因为这些高层次特征图明显具有类别偏见。在本文中,我们提出用视觉-文本对齐能力来替代视觉先验表示,以捕捉更可靠的指导并提高模型的泛化能力。具体来说,我们设计了两种类型的自适应先验信息生成策略,试图利用预训练的对比语言-图像模型(CLIP)的语义对齐能力来查找目标类别。此外,为了获得更精确的先验指导,我们构建了关注图的高级关系,并利用它来优化初始先验信息。在PASCAL-5和COCO-20数据集上的实验表明,我们的方法取得了显着的改进,并达到了与现有最佳性能相当的水平。
https://arxiv.org/abs/2405.08458
Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: this https URL.
由于表格的简洁和结构化特点,其中包含的知识可能不完整或缺失,这对表格问题回答(TableQA)和数据分析系统构成了重大挑战。现有数据集中,要么未能解决表格外部知识的这个问题,要么仅将结构化文本作为表格的补充信息。在本文中,我们将知识库(KB)作为表格问题回答的外部知识来源,并构建了一个细粒度 gold 证据注释的 dataset KET-QA。数据集中的每个表对应于整个知识库的子图,每个问题都需要从表格和子图整合信息来回答。为了从庞大的知识子图中提取相关信息并应用于表格问题回答,我们设计了一个retriever-reasoner结构化管道模型。实验结果表明,我们的模型在三个不同的设置(微调、零散和少散)上实现了引人注目的相对性能改进,从1.9到6.5倍,以及绝对性能改进11.66%到44.64%。与仅依赖表格信息的传统 TableQA 方法相比,即使在最好的模型上,EM 分数也落后了人类水平。这揭示了对于问答社区来说,KET-QA 的具有挑战性的本质。我们还提供了人类评估错误案例以进一步分析模型可以改进的方面。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08099
Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.
尽管在英语主导的生成式大型语言模型的帮助下,低资源语言的增强全球可访问性还需要进一步发展。表示这些语言的主要方法是单语和多语预训练。单语预训练由于硬件要求而昂贵,而多语模型在语言之间往往存在不均匀的性能。本研究探讨了一种通过将主要在英语上训练的大型语言模型适应低资源语言的替代方案。我们评估了各种策略,包括持续训练、指令微调、任务特定微调和词汇扩展。结果表明,持续训练改善了语言理解,这在词义理解分数上得到了反映,而任务特定微调通常会增强下游任务的性能。然而,扩展词汇并没有实质性的好处。此外,虽然大型模型通过少样本调整可以提高任务性能,但多语模型在适应时表现得更差。
https://arxiv.org/abs/2405.07745
Detecting anomaly edges for dynamic graphs aims to identify edges significantly deviating from the normal pattern and can be applied in various domains, such as cybersecurity, financial transactions and AIOps. With the evolving of time, the types of anomaly edges are emerging and the labeled anomaly samples are few for each type. Current methods are either designed to detect randomly inserted edges or require sufficient labeled data for model training, which harms their applicability for real-world applications. In this paper, we study this problem by cooperating with the rich knowledge encoded in large language models(LLMs) and propose a method, namely AnomalyLLM. To align the dynamic graph with LLMs, AnomalyLLM pre-trains a dynamic-aware encoder to generate the representations of edges and reprograms the edges using the prototypes of word embeddings. Along with the encoder, we design an in-context learning framework that integrates the information of a few labeled samples to achieve few-shot anomaly detection. Experiments on four datasets reveal that AnomalyLLM can not only significantly improve the performance of few-shot anomaly detection, but also achieve superior results on new anomalies without any update of model parameters.
检测动态图中的异常边旨在识别与正常模式显著不同的边,可以应用于各种领域,如网络安全、金融交易和AIOps。随着时间的推移,异常边的类型不断涌现,每个类型的有标签数据也越来越少。目前的方法要么是设计来检测随机插入的边,要么需要足够的有标签数据进行模型训练,这会损害其在现实应用中的适用性。在本文中,我们通过与拥有丰富知识的大型语言模型(LLMs)合作,研究了这个问题,并提出了名为 AnomalyLLM 的方法。为了将动态图与LLMs对齐,AnomalyLLM 通过预训练一个动态感知编码器来生成边缘的表示,并使用词向量示例对其进行重编程。与编码器一起,我们设计了一个上下文学习框架,将几个有标签样本的信息整合在一起,实现 few-shot 异常检测。在四个数据集上的实验表明,AnomalyLLM 不仅可以显著提高 few-shot 异常检测的性能,而且在新异常上还能获得更好的结果,而无需对模型参数进行更新。
https://arxiv.org/abs/2405.07626
High-quality images are crucial in remote sensing and UAV applications, but atmospheric haze can severely degrade image quality, making image dehazing a critical research area. Since the introduction of deep convolutional neural networks, numerous approaches have been proposed, and even more have emerged with the development of vision transformers and contrastive/few-shot learning. Simultaneously, papers describing dehazing architectures applicable to various Remote Sensing (RS) domains are also being published. This review goes beyond the traditional focus on benchmarked haze datasets, as we also explore the application of dehazing techniques to remote sensing and UAV datasets, providing a comprehensive overview of both deep learning and prior-based approaches in these domains. We identify key challenges, including the lack of large-scale RS datasets and the need for more robust evaluation metrics, and outline potential solutions and future research directions to address them. This review is the first, to our knowledge, to provide comprehensive discussions on both existing and very recent dehazing approaches (as of 2024) on benchmarked and RS datasets, including UAV-based imagery.
高质量的图像在遥感和无人机应用中至关重要,但大气雾霾会严重破坏图像质量,使图像去雾成为一个关键的研究领域。自深度卷积神经网络的引入,已经提出了许多方法,随着视觉变压器和对比/零样本学习的发展,更多方法也应运而生。同时,描述适用于各种遥感(RS)领域的去雾架构的论文也在不断发表。本综述超越了传统关注基准雾数据集的范围,我们还在遥感和无人机数据上探讨了去雾技术的应用,为这些领域提供了一个全面的深度学习和基于先验方法的研究概述。我们指出了关键挑战,包括缺乏大规模 RS 数据集和需要更健壮的评估指标,并提出了可能的解决方案和未来的研究方向来解决这些挑战。据我们所知,这是第一部关于基准和 RS 数据集上现有和非常最近去雾方法的综合讨论(截至 2024 年)。包括基于 UAV 的图像。
https://arxiv.org/abs/2405.07520
In recent years, deep learning based on Convolutional Neural Networks (CNNs) has achieved remarkable success in many applications. However, their heavy reliance on extensive labeled data and limited generalization ability to unseen classes pose challenges to their suitability for medical image processing tasks. Few-shot learning, which utilizes a small amount of labeled data to generalize to unseen classes, has emerged as a critical research area, attracting substantial attention. Currently, most studies employ a prototype-based approach, in which prototypical networks are used to construct prototypes from the support set, guiding the processing of the query set to obtain the final results. While effective, this approach heavily relies on the support set while neglecting the query set, resulting in notable disparities within the model classes. To mitigate this drawback, we propose a novel Support-Query Prototype Fusion Network (SQPFNet). SQPFNet initially generates several support prototypes for the foreground areas of the support images, thus producing a coarse segmentation mask. Subsequently, a query prototype is constructed based on the coarse segmentation mask, additionally exploiting pattern information in the query set. Thus, SQPFNet constructs high-quality support-query fused prototypes, upon which the query image is segmented to obtain the final refined query mask. Evaluation results on two public datasets, SABS and CMR, show that SQPFNet achieves state-of-the-art performance.
近年来,基于卷积神经网络(CNNs)的深度学习在许多应用领域取得了显著的成功。然而,它们对大量标记数据的高度依赖和对于未见过的类别的有限泛化能力,使得它们在医学图像处理任务上并不适用。少量样本学习,利用少量的标记数据推广到未见过的类别,成为一个关键的研究领域,吸引了大量关注。目前,大多数研究采用基于原型的方法,其中原型网络用于从支持集构建原型,指导查询集的加工以获得最终结果。虽然有效,但这种方法在支持集上过于依赖,而忽略了查询集,导致模型类之间的差异显著。为了减轻这一缺点,我们提出了一个新的支持-查询原型融合网络(SQPFNet)。 SQPFNet首先为支持图像的前景区域生成几个支持原型,从而产生粗分割掩码。接着,基于粗分割掩码构建查询原型,并利用查询集中的模式信息。因此,SQPFNet构建了高质量的支持-查询融合原型,在查询图像上进行分割,以获得最终精化的查询掩码。在两个公开数据集SABS和CMR上的评估结果表明,SQPFNet实现了最先进的性能。
https://arxiv.org/abs/2405.07516
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then investigate whether we can improve GPT-3's articulation accuracy through a range of methods. GPT-3 completely fails to articulate 7/10 rules in our test, even after additional finetuning on correct explanations. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.
大语言模型(LLMs)在许多任务上表现出色,但解释其取得这一成就背后的过程是一个挑战。本文研究了LLMs是否能够给出其自己内部过程的高层次忠实解释。为了探讨这一点,我们引入了一个数据集ArticulateRules,它是通过简单的规则生成的少量 shot文本分类任务的数据集。每个规则都与一个简单的自然语言解释相关联。我们测试了具有良好分类输入( both in- and out-of-distribution)的模型是否能够阐述自由文本解释,这些解释与其分类行为相匹配。我们的数据集既可以用于当前上下文评估,也可以用于微调评估。我们评估了一系列LLM,证明了 articulation准确率在模型之间存在很大差异,尤其是从GPT-3到GPT-4的准确率增长变得非常锐利。接下来,我们研究了是否可以通过各种方法提高GPT-3的articulation准确率。GPT-3在我们的测试中完全无法阐述7/10条规则,即使在经过额外的正确解释的微调后也是如此。我们发布了我们的数据集ArticulateRules,该数据集可用于测试为in-context或通过微调训练的LLM。
https://arxiv.org/abs/2405.07436
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at this https URL
我们提出了MedConceptsQA,一个专用的开源医疗概念问题回答基准。基准包括来自不同词汇表的医疗概念问题的各种类型:诊断、程序和药物。问题分为三个难度级别:容易、中难和难。我们使用各种大型语言模型对基准进行评估。我们的研究结果表明,尽管预先训练的临床大型语言模型在医疗数据上进行预训练,但它们在基准上的准确性接近于随机猜测。然而,GPT-4在临床大型语言模型上实现了几乎27%-37%(27%的零散学习百分比和37%的少数学习百分比)的绝对平均改善。我们的基准为评估大型语言模型对医疗概念的理解和推理提供了有价值的资源。基准可以在https:// this URL
https://arxiv.org/abs/2405.07348
Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder.
由于扩散模型的扩展和计算机视觉与自然语言领域的进步,从文本中生成图像变得更加容易。这些模型使用从互联网上大量数据进行训练。因此,它们通常包含不想包含的具有版权的内容。由于很难删除这种数据并重新训练模型,因此研究了从预训练模型中消除特定概念的方法。我们提出了一个新颖的概念消除方法,该方法使用少样本无监督学习更新文本编码器,其中使用几张真实图片。关于消除概念后生成图像的讨论是缺乏的。虽然有一些方法可以指定概念的转移目标,但是指定概念的有效性仍然不确定。我们的方法通过将模型或图像的潜在概念间接转换来实现这一目标。我们的方法可以在10秒内消除一个概念,使概念消除比以往任何时候都更加容易。通过调整需要更新的参数,我们得到了结果,表明与以前的研究类似,文本编码器的递归网络中知识的主要积累是主要的。
https://arxiv.org/abs/2405.07288
Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
语素丰富的语言(SMLs)的特点是极端的词义模糊。 因为标准文本中大多数元音都被省略了,所以许多单词是多义词,每个义项都有不同的发音和形态学特征。这种模糊性超出了词义歧义(WSD)的范围,甚至可能包括将词切分为多个词单位的情况。以前关于SML的研究声称,基于词块的预训练语言模型(PLMs)可能不足以捕捉这类词的内部结构,以区分这些分析。以希伯来语为例,我们研究了使用PLMs对希伯来语同义词进行歧义和分析的程度。我们在为新创希伯来语同义词挑战集上评估所有现有的模型。我们的实验结果表明,当代希伯来语预处理嵌入效果优于非预处理嵌入;而且它们最有效地用于区分词义和形态特征,对于纯词义歧义的效果相对较低。我们发现,当词块分割数量有限时,这些嵌入效果更有效;而且它们对于二义性和三元词性的效果比四元词性效果更佳。我们发现,这些嵌入对于平衡和偏斜分布的词相同效果,无论是作为遮罩词还是未遮罩词。最后,我们发现这些嵌入对于通过广泛监督训练进行词相同歧义的效果与少数次试验设置的效果相同。
https://arxiv.org/abs/2405.07099
Garment manipulation (e.g., unfolding, folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks, while highly challenging due to the diversity of garment configurations, geometries and deformations. Although able to manipulate similar shaped garments in a certain task, previous works mostly have to design different policies for different tasks, could not generalize to garments with diverse geometries, and often rely heavily on human-annotated data. In this paper, we leverage the property that, garments in a certain category have similar structures, and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks, within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios, using one or two arms, taking one or more steps, inputting flat or messy garments, demonstrate the effectiveness of our proposed method. Project page: this https URL.
衣物的操作(例如,展开、折叠和挂起衣物)对于未来机器人完成家庭助手任务至关重要,但由于衣物配置、几何形状和变形程度的多样性,实现这一目标具有极大挑战性。尽管在某些任务中,它们能够操纵具有相似形状的衣物,但以前的工作主要需要为不同任务设计不同的策略,不能推广到具有不同几何形状的衣物,并且通常依赖于人类标注的数据。在本文中,我们利用衣物属于同一类别的衣物具有相似结构的性质,然后以自监督的方式在同一类别级别学习衣物之间的拓扑密集(点级别)视觉对应关系。这种拓扑对应关系可以轻松地适应功能的对应关系来指导各种下游任务的操作策略,仅需几轮演示。在多样场景下的3种不同类别的衣物上进行实验,使用1或2个手臂,进行1或2步操作,输入平滑或凌乱的衣物,证明了我们所提出方法的有效性。项目页面:此链接。
https://arxiv.org/abs/2405.06903
In today's digital landscape, where cyber attacks have become the norm, the detection of cyber attacks and threats is critically imperative across diverse domains. Our research presents a new empirical framework for cyber threat modeling, adept at parsing and categorizing cyber-related information from news articles, enhancing real-time vigilance for market stakeholders. At the core of this framework is a fine-tuned BERT model, which we call CANAL - Cyber Activity News Alerting Language Model, tailored for cyber categorization using a novel silver labeling approach powered by Random Forest. We benchmark CANAL against larger, costlier LLMs, including GPT-4, LLaMA, and Zephyr, highlighting their zero to few-shot learning in cyber news classification. CANAL demonstrates superior performance by outperforming all other LLM counterparts in both accuracy and cost-effectiveness. Furthermore, we introduce the Cyber Signal Discovery module, a strategic component designed to efficiently detect emerging cyber signals from news articles. Collectively, CANAL and Cyber Signal Discovery module equip our framework to provide a robust and cost-effective solution for businesses that require agile responses to cyber intelligence.
在当今数字 landscape中,网络攻击已成为常态,跨多个领域的检测网络攻击和威胁至关重要。我们的研究提出了一种新的实证框架,用于网络威胁建模,善于解析和分类与网络相关的信息,增强市场参与者的实时警惕。这个框架的核心是一个经过微调的BERT模型,我们称之为CANAL - 网络活动新闻警报语言模型,采用了一种新的银标注方法,利用随机森林进行网络安全分类。我们对比CANAL与其他大型、昂贵的LLM,包括GPT-4、LLLaMA和Zephyr,突显了它们在网络新闻分类中的零到几 shot学习。通过超越所有其他LLM备选,CANAL在准确性和性价比方面都表现卓越。此外,我们还引入了Cyber Signal Discovery模块,这是一个 strategic 组件,旨在有效地从新闻文章中检测新兴的网络安全信号。总之,CANAL 和 Cyber Signal Discovery 模块使我们的框架能够为需要对网络情报作出敏捷反应的企业提供稳健且经济有效的解决方案。
https://arxiv.org/abs/2405.06772
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. Nevertheless, an ongoing controversy exists over the extent to which LLMs can grasp structured semantics. To assess this, we propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. In our assessment, we employ the prompting approach, which leads to the creation of our few-shot SRL parser, called PromptSRL. PromptSRL enables LLMs to map natural languages to explicit semantic structures, which provides an interpretable window into the properties of LLMs. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. Additionally, limitations of LLMs are observed in C-arguments, etc. Lastly, we are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
大语言模型(LLMs)在捕捉结构化语义以提高语言理解、改善可解释性和减少偏见方面发挥了关键作用。然而,目前存在一个持续的争议,即LLMs是否能够把握结构化语义。为了评估这一点,我们提出使用语义角色标注(SRL)作为基本任务,以探讨LLMs提取结构化语义的能力。在我们的评估中,我们采用提示方法,这是我们用于构建几 shot SRL 解析器的称为提示SRL。提示SRL 使LLMs能够将自然语言映射到明确的语义结构中,为LLMs提供了对LLMs性质的直观认识。我们发现有趣的应用潜力:LLMs确实可以捕捉语义结构,而且提高并不总是与潜力成正比。此外,在C-论据等地方,LLM的局限性也被观察到。最后,我们惊讶地发现,LLMs和未经训练的人类之间的错误重叠几乎达到了30%以上,这一发现解释了几乎所有错误。
https://arxiv.org/abs/2405.06410