Large Language Models (LLMs) are deep learning models designed to generate text based on textual input. Although researchers have been developing these models for more complex tasks such as code generation and general reasoning, few efforts have explored how LLMs can be applied to combinatorial problems. In this research, we investigate the potential of LLMs to solve the Travelling Salesman Problem (TSP). Utilizing GPT-3.5 Turbo, we conducted experiments employing various approaches, including zero-shot in-context learning, few-shot in-context learning, and chain-of-thoughts (CoT). Consequently, we fine-tuned GPT-3.5 Turbo to solve a specific problem size and tested it using a set of various instance sizes. The fine-tuned models demonstrated promising performance on problems identical in size to the training instances and generalized well to larger problems. Furthermore, to improve the performance of the fine-tuned model without incurring additional training costs, we adopted a self-ensemble approach to improve the quality of the solutions.
大语言模型(LLMs)是一种基于文本输入的深度学习模型,旨在生成文本。尽管研究人员已经为更复杂的任务,如代码生成和一般推理开发了这些模型,但很少有人研究过LLMs如何应用于组合问题。在本文中,我们研究了LLMs解决旅行推销员问题(TSP)的潜力。通过利用GPT-3.5涡轮,我们进行了各种实验,包括零散在上下文中学习、少量在上下文中学习和连锁思维(CoT)。因此,我们微调了GPT-3.5涡轮以解决特定问题大小,并使用一系列不同的实例大小对其进行了测试。微调后的模型在大小与训练实例相同的问题上表现出优异的性能,并且对较大问题表现出良好的泛化能力。此外,为了在不产生额外训练成本的情况下提高微调模型的性能,我们还采用了一种自集成方法来提高解决方案的质量。
https://arxiv.org/abs/2405.01997
Time series data are ubiquitous across various domains, making time series analysis critically important. Traditional time series models are task-specific, featuring singular functionality and limited generalization capacity. Recently, large language foundation models have unveiled their remarkable capabilities for cross-task transferability, zero-shot/few-shot learning, and decision-making explainability. This success has sparked interest in the exploration of foundation models to solve multiple time series challenges simultaneously. There are two main research lines, namely \textbf{pre-training foundation models from scratch for time series} and \textbf{adapting large language foundation models for time series}. They both contribute to the development of a unified model that is highly generalizable, versatile, and comprehensible for time series analysis. This survey offers a 3E analytical framework for comprehensive examination of related research. Specifically, we examine existing works from three dimensions, namely \textbf{Effectiveness}, \textbf{Efficiency} and \textbf{Explainability}. In each dimension, we focus on discussing how related works devise tailored solution by considering unique challenges in the realm of time series.Furthermore, we provide a domain taxonomy to help followers keep up with the domain-specific advancements. In addition, we introduce extensive resources to facilitate the field's development, including datasets, open-source, time series libraries. A GitHub repository is also maintained for resource updates (this https URL).
翻译:时间序列数据在各种领域无处不在,使得时间序列分析至关重要。传统的时间序列模型是任务特定的,具有有限的泛化能力。最近,大型语言模型揭示了其在跨任务转移能力、零 shot/少数 shot学习和决策可解释性方面的非凡能力。这一成功引发了人们对基础模型同时解决多个时间序列挑战的探索兴趣。有两个主要的研究方向:从零开始构建时间序列基础模型和调整大型语言模型以应对时间序列。它们都为开发一个高度通用、多才多艺且易于理解的时间序列分析模型做出了贡献。本调查提供了一个3E的分析框架,对相关研究进行全面评估。具体来说,我们从三个维度进行研究,即有效性、效率和可解释性。在每一个维度上,我们关注相关研究如何通过考虑时间序列领域内的独特挑战来制定定制解决方案。此外,我们还为该领域的发展提供了广泛的资源,包括数据集、开源时间和序列库。还维护了一个GitHub仓库,以便于更新资源(此https URL)。
https://arxiv.org/abs/2405.02358
Generalization to unseen data remains poorly understood for deep learning classification and foundation models. How can one assess the ability of networks to adapt to new or extended versions of their input space in the spirit of few-shot learning, out-of-distribution generalization, and domain adaptation? Which layers of a network are likely to generalize best? We provide a new method for evaluating the capacity of networks to represent a sampled domain, regardless of whether the network has been trained on all classes in the domain. Our approach is the following: after fine-tuning state-of-the-art pre-trained models for visual classification on a particular domain, we assess their performance on data from related but distinct variations in that domain. Generalization power is quantified as a function of the latent embeddings of unseen data from intermediate layers for both unsupervised and supervised settings. Working throughout all stages of the network, we find that (i) high classification accuracy does not imply high generalizability; and (ii) deeper layers in a model do not always generalize the best, which has implications for pruning. Since the trends observed across datasets are largely consistent, we conclude that our approach reveals (a function of) the intrinsic capacity of the different layers of a model to generalize.
推广到未见过的数据在深度学习和基础模型中仍然存在很大的不确定性。如何评估网络在面对新或扩展的输入空间时的适应能力,以及少样本学习、离散泛化和领域适应?网络的哪些层可能最具泛化能力?我们提出了一种评估网络对给定域的表示能力的方法,无论网络是否在域上进行过所有的类别的预训练。我们的方法如下:在特定域上对先进的预训练模型进行微调后,我们评估它们在相关但不同的域上的数据上的表现。泛化能力被量化为中间层未见过的数据的潜在表示的功能,无论是无监督还是监督设置。在整个网络的工作过程中,我们发现:(i)高分类准确率并不一定意味着高泛化能力;(ii)模型中的更深层并不总是泛化最好,这会对剪裁产生影响。由于数据集的趋势在很大程度上是一致的,我们得出结论,我们的方法揭示了模型不同层之间泛化的内在能力。
https://arxiv.org/abs/2405.01524
Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL). In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance.
近年来在深度学习方面的进步在各种监督计算机视觉任务中展示了与人类能力相当的表现。然而,在模型训练之前假设拥有一个包含所有类别的广泛训练数据集通常与现实世界场景相悖,因为在 novel 类别的数据可用性受限的情况下,这种假设往往会导致模型的性能下降。挑战在于将新类别的样本无缝地集成到训练数据中,要求模型在不影响基础类别的性能的情况下适应这些添加。为解决这一紧迫情况,研究社区已经提出了一些解决方案,属于少样本分类增量学习(FSCIL)领域。在这项研究中,我们引入了一种创新的 FSCIL 框架,该框架利用语言正则化和子空间正则化。在基础训练期间,语言正则化有助于将 Vision-Language 模型中提取的语义信息整合到模型中。子空间正则化有助于在增量训练期间促进模型从基础类别的图像和文本语义中获取细微的连接。我们提出的框架不仅使模型能够拥抱数据有限的新类别,而且还确保了基础类别的性能不受影响。为了验证我们方法的效力,我们在三个不同的 FSCIL 基准上进行了全面的实验,我们的框架在这些基准上取得了最先进的性能。
https://arxiv.org/abs/2405.01040
Learning to solve vehicle routing problems (VRPs) has garnered much attention. However, most neural solvers are only structured and trained independently on a specific problem, making them less generic and practical. In this paper, we aim to develop a unified neural solver that can cope with a range of VRP variants simultaneously. Specifically, we propose a multi-task vehicle routing solver with mixture-of-experts (MVMoE), which greatly enhances the model capacity without a proportional increase in computation. We further develop a hierarchical gating mechanism for the MVMoE, delivering a good trade-off between empirical performance and computational complexity. Experimentally, our method significantly promotes the zero-shot generalization performance on 10 unseen VRP variants, and showcases decent results on the few-shot setting and real-world benchmark instances. We further provide extensive studies on the effect of MoE configurations in solving VRPs. Surprisingly, the hierarchical gating can achieve much better out-of-distribution generalization performance. The source code is available at: this https URL.
学习解决车辆路由问题(VRPs)已经引起了广泛关注。然而,大多数神经网络解决方案只能在特定问题上进行结构化和训练,使它们对其他问题不具有通用性和实用性。在本文中,我们旨在开发一个统一的神经网络解决方案,可以同时处理多种 VRP 变体。具体来说,我们提出了一个多任务车辆路由解决方案(MVMoE),极大地增强了模型能力,而不会增加计算成本。我们进一步开发了一个分层的 gate 机制,使得 MVMoE 可以实现良好的实证性能和计算复杂度的平衡。实验表明,我们的方法在未见过的 10 个 VRP 变体上显著促进了零散样本通用性能,在几见过的设置和现实世界的基准实例上的表现也相当不错。我们还对 MoE 配置对解决 VRP 的影响进行了广泛研究。令人惊讶的是,分层门控可以实现更好的离散样本通用性能。代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2405.01029
Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.
尽管预训练语言模型在基于提示的少样本学习方面表现出巨大的灵活性和多样性,但它们仍然受到参数数量庞大和推理应用受限的问题。最近的研究表明,PLM可以作为数据集生成器,并训练一个任务特定的微小模型来实现高效的推理。然而,由于它们倾向于生成特定领域的数据集,因此它们的适用性在各个领域都有限。在本文中,我们提出了一种通用的领域泛化方法,可以生成与目标领域不同的数据集。这使得将微小任务模型应用于任何具有共同标签空间的所有领域成为可能,从而提高了数据生成范式在现实世界的应用价值。我们的实验结果表明,与PLM相比,所提出的方法在各个领域都实现了泛化能力,同时使用了参数集 orders of magnitude 更小。
https://arxiv.org/abs/2405.01022
One-on-one tutoring is widely acknowledged as an effective instructional method, conditioned on qualified tutors. However, the high demand for qualified tutors remains a challenge, often necessitating the training of novice tutors (i.e., trainees) to ensure effective tutoring. Research suggests that providing timely explanatory feedback can facilitate the training process for trainees. However, it presents challenges due to the time-consuming nature of assessing trainee performance by human experts. Inspired by the recent advancements of large language models (LLMs), our study employed the GPT-4 model to build an explanatory feedback system. This system identifies trainees' responses in binary form (i.e., correct/incorrect) and automatically provides template-based feedback with responses appropriately rephrased by the GPT-4 model. We conducted our study on 410 responses from trainees across three training lessons: Giving Effective Praise, Reacting to Errors, and Determining What Students Know. Our findings indicate that: 1) using a few-shot approach, the GPT-4 model effectively identifies correct/incorrect trainees' responses from three training lessons with an average F1 score of 0.84 and an AUC score of 0.85; and 2) using the few-shot approach, the GPT-4 model adeptly rephrases incorrect trainees' responses into desired responses, achieving performance comparable to that of human experts.
一对一辅导被广泛认为是有效的教学方法,但合格的导师数量仍然是一个挑战,通常需要培训新手导师以确保有效的辅导。研究表明,及时的反馈解释可以促进学员的培训过程。然而,由于通过人类专家评估学员表现需要花费较长的时间,因此存在挑战。受到大型语言模型(LLMs)最近取得的进步的启发,我们的研究采用GPT-4模型构建了解释反馈系统。这个系统将学员的回答以二进制形式(即正确/错误)进行识别,并自动提供基于回答适当修改的模板为基础的反馈。我们在三个培训课程中的410名学员的回答上进行了研究:给予有效表扬,对错误作出反应,以及确定学生知道什么。我们的研究结果表明:1)使用少数样本方法,GPT-4模型能有效地从三个培训课程中识别出正确/错误的学员回答,平均F1得分达到0.84,AUC得分达到0.85;2)使用少数样本方法,GPT-4模型能巧妙地将错误的学员回答重新表述成期望的回答,达到人类专家的水平。
https://arxiv.org/abs/2405.00970
There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
音频语言检索研究引起了越来越多的关注,其目标是建立音频和文本模态之间的相关性。然而,大多数音频-文本配对数据集通常缺乏文本数据的丰富表达,与音频样本相比。音频-文本数据集面临的一个关键挑战是,尽管存在不同的音频样本,但存在与音频样本相似或相同的字幕。因此,在许多对一映射条件下,音频-文本数据集导致检索任务的性能较差。在本文中,我们提出了一个新方法来解决音频-语言检索任务中的数据不平衡问题。为了克服这一限制,我们引入了一种基于距离采样 的文本同义词生成方法,利用 ChatGPT,通过距离函数生成可控制文本数据的操纵分布。对于具有相同上下文的句子,距离用于计算任意两个句子之间的 manipulation 程度,而 ChatGPT 的 few-shot 提示通过具有相同距离定义的文本簇进行。因此,当将 ChatGPT 应用于 few-shot 提示与文本簇时,可以根据距离调整被操纵文本的多样性。该方法被证明可以在音频-语言检索中显著增强性能,超过传统文本增强技术。
https://arxiv.org/abs/2405.00367
This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.
本文研究了自监督预训练Transformer与监督预训练Transformer和传统神经网络(CNN)在检测不同种类的深度伪造时的有效性。我们重点关注其提高泛化能力的潜力,尤其是在训练数据有限的情况下。尽管在各种任务中成功应用大型视觉语言模型(ViT)的Transformer架构取得了显著的成功,包括零 shot 和少 shot 学习,但深度伪造检测领域仍然有人不情愿采用预训练的视觉Transformer(ViT),尤其是大型的 ones 作为特征提取器。一个关注点是它们被认为具有过度的能力,通常需要大量的数据,并且在训练或微调数据小或缺乏多样性时,导致泛化效果不佳。这使得Transformer与 ConvNets 相比相形见绌,后者已经在学术研究中被证明是稳健的特征提取器。此外,从零开始训练和优化Transformer需要大量的计算资源,这使得这个方法主要对大型公司开放,同时也限制了学术研究 community 的进一步调查。最近在Transformer中使用自监督学习(SSL)的进展,如DINO及其派生物,展示了在各种视觉任务中的显著适应性,并具有显式的语义分割能力。通过利用DINO进行深度伪造检测,并在小训练数据上进行部分微调,我们观察到与任务相当的应用能力,并通过注意机制自然地解释检测结果。此外,部分微调Transformer进行深度伪造检测提供了一种更资源有效的选择,只需要显著少的计算资源。
https://arxiv.org/abs/2405.00355
Out-of-distribution (OOD) problems in few-shot classification (FSC) occur when novel classes sampled from testing distributions differ from base classes drawn from training distributions, which considerably degrades the performance of deep learning models deployed in real-world applications. Recent studies suggest that the OOD problems in FSC mainly including: (a) cross-domain few-shot classification (CD-FSC) and (b) spurious-correlation few-shot classification (SC-FSC). Specifically, CD-FSC occurs when a classifier learns transferring knowledge from base classes drawn from seen training distributions but recognizes novel classes sampled from unseen testing distributions. In contrast, SC-FSC arises when a classifier relies on non-causal features (or contexts) that happen to be correlated with the labels (or concepts) in base classes but such relationships no longer hold during the model deployment. Despite CD-FSC has been extensively studied, SC-FSC remains understudied due to lack of the corresponding evaluation benchmarks. To this end, we present Meta Concept Context (MetaCoCo), a benchmark with spurious-correlation shifts collected from real-world scenarios. Moreover, to quantify the extent of spurious-correlation shifts of the presented MetaCoCo, we further propose a metric by using CLIP as a pre-trained vision-language model. Extensive experiments on the proposed benchmark are performed to evaluate the state-of-the-art methods in FSC, cross-domain shifts, and self-supervised learning. The experimental results show that the performance of the existing methods degrades significantly in the presence of spurious-correlation shifts. We open-source all codes of our benchmark and hope that the proposed MetaCoCo can facilitate future research on spurious-correlation shifts problems in FSC. The code is available at: this https URL.
翻译:在少样本分类(FSC)中,异分布(OOD)问题主要包括:(a)跨域少样本分类(CD-FSC)和(b)伪相关少样本分类(SC-FSC)。具体来说,当分类器从可见训练分布中的基类学习知识,但同时从未见测试分布中采样新类时,就会出现CD-FSC。相反,当分类器依赖于与基类标签(或概念)相关但这种关系在模型部署时不再成立时,就会出现SC-FSC。尽管CD-FSC已经得到了广泛研究,但SC-FSC仍然因为没有相应的评估基准而备受忽视。为此,我们提出了元概念上下文(MetaCoCo),一个从现实世界场景中收集到的伪相关转移的基准。此外,为了量化所提出的MetaCoCo中伪相关转移的程度,我们进一步使用CLIP作为预训练的视觉语言模型提出了一个指标。对所提出的基准进行的大量实验用于评估FSC中的最先进方法、跨域转移和自监督学习的状态。实验结果表明,在存在伪相关转移的情况下,现有方法的性能显著下降。我们开源了我们的基准代码,并希望所提出的MetaCoCo能够促进未来关于FSC中伪相关转移问题的研究。代码可用在这个链接上:https://this URL。
https://arxiv.org/abs/2404.19644
Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of tasks, presumably due to the challenges involved in designing a generic model capable of representing diverse tasks in a unified manner. In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework. Additionally, we propose Structure-Aware Point Learning (SAPL) to exploit the higher-order structural relationship among points to further enhance representation learning. Our approach makes minimal assumptions about the tasks, yet it achieves competitive results compared to highly specialized and well optimized specialist models. Codes will be released soon.
实例感知任务(目标检测、实例分割、姿态估计、计数)在工业视觉模型的应用中扮演着关键角色。由于监督学习方法受到高标注成本的影响,希望寻求一种有效的少样本学习方法,可以从有限的标记示例中高效学习。现有的少样本学习方法主要集中于一个受限的任务集,可能是由于设计一个通用模型来表示多样任务的挑战性较大。在本文中,我们提出了UniFS,一种统一少样本实例感知模型,通过将它们重新解释为一个动态点表示学习框架,将广泛的实例感知任务统一起来。此外,我们还提出了结构感知点学习(SAPL)来利用点之间的更高阶结构关系进一步增强表示学习。我们的方法对任务的需求相对较低,然而,与高度专业化和优化良好的专家模型相比,其竞争结果具有优势。代码不久将发布。
https://arxiv.org/abs/2404.19401
Large language models have shown their ability to become effective few-shot learners with prompting, revoluting the paradigm of learning with data scarcity. However, this approach largely depends on the quality of prompt initialization, and always exhibits large variability among different runs. Such property makes prompt tuning highly unreliable and vulnerable to poorly constructed prompts, which limits its extension to more real-world applications. To tackle this issue, we propose to treat the hard prompt and soft prompt as separate inputs to mitigate noise brought by the prompt initialization. Furthermore, we optimize soft prompts with contrastive learning for utilizing class-aware information in the training process to maintain model performance. Experimental results demonstrate that \sysname outperforms state-of-the-art methods by 7.20% in accuracy and reduces the standard deviation by 2.02 on average. Furthermore, extensive experiments underscore its robustness and stability across 7 datasets covering various tasks.
大语言模型通过提示展示了在数据稀缺的情况下成为有效的少样本学习者,颠覆了学习范式。然而,这种方法在不同的运行中很大程度上取决于提示的初始质量,并且表现出很大的变异性。这种特性使得提示调整变得高度不可靠,容易受到构建不良提示的损害,从而限制了其在更广泛的现实应用中的扩展。为了解决这个问题,我们提出将难提示和软提示视为独立的输入以减轻提示初始化带来的噪声。此外,我们通过对比学习优化软提示,以在训练过程中利用类感知信息来维持模型性能。实验结果表明,\sysname在准确度上比最先进的 methods提高了7.20%,平均减少了2.02的标准差。此外,广泛的实验证实了其在各种任务上的稳健性和稳定性。
https://arxiv.org/abs/2404.19335
Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.
大型视觉语言模型在零/少样本场景中显著提高了2D视觉识别的表现。在本文中,我们将重点利用CLIP(大型视觉语言模型)来解决基于多视图表示的零/少样本3D形状识别。这两个任务的关键挑战是在没有明确训练(零样本3D形状识别)或有限数据训练(少样本3D形状识别)的场景中生成对多视图表示的3D形状的区分性描述。我们分析认为,这两个任务是相关的,可以同时考虑。具体来说,利用在零样本推理中有效的描述器来引导在少样本训练中聚合描述符可以显著提高少样本学习效果。因此,我们提出了Prompt-Enhanced View Aggregation Network(PEVA-Net)来同时解决零/少样本3D形状识别。在零样本场景中,我们希望通过利用从候选类别中构建的提示来增强多视图相关视觉特征的聚合过程。生成的聚合特征可用于有效的零样本3D形状识别。在少样本场景中,我们首先利用Transformer编码器将视图相关视觉特征聚合为全局描述符。为了调整编码器,我们通过一个通过特征蒸馏损失对零样本描述进行自监督学习的方案来提出一个自监督学习方案。这个方案可以显著增强少样本学习效果。
https://arxiv.org/abs/2404.19168
Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.
最近,关于指令式代理的研究使用了记忆增强的大型语言模型(LLMs)作为任务规划者,这是一种通过检索与输入指令相关的语言程序实例并使用它们作为LLM提示中的上下文示例来提高LLM在推断正确动作和任务计划中的性能的技术。在本文的技术报告中,我们通过扩展HELPER的功能,通过增加更广泛的示例和提示,以及添加询问API,来扩展其能力。这种简单的扩展使得HELPER可以应用于执行计划领域,包括对话、自然语言指令跟随、主动问题询问和共同空间组织。我们在四个具有多样性的交互式视觉语言 embodied 代理基准上评估了代理:ALFRED、TEACh、DialFRED 和 Tidy Task。HELPER-X在这些基准上使用单个代理实现了卓越的少样本、状态最先进的性能,而无需进行领域内训练,且与经过领域内训练的代理竞争。
https://arxiv.org/abs/2404.19065
Speaker recognition technology is applied in various tasks ranging from personal virtual assistants to secure access systems. However, the robustness of these systems against adversarial attacks, particularly to additive perturbations, remains a significant challenge. In this paper, we pioneer applying robustness certification techniques to speaker recognition, originally developed for the image domain. In our work, we cover this gap by transferring and improving randomized smoothing certification techniques against norm-bounded additive perturbations for classification and few-shot learning tasks to speaker recognition. We demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for several models. We expect this work to improve voice-biometry robustness, establish a new certification benchmark, and accelerate research of certification methods in the audio domain.
演讲者识别技术在各种任务中都有应用,从个人虚拟助手到安全访问系统。然而,这些系统对于对抗性攻击的鲁棒性,特别是对于添加扰动,仍然是一个重要的挑战。在本文中,我们首创将鲁棒性认证技术应用于演讲者识别,最初是为图像领域设计的。在我们的工作中,我们通过将规范有界添加扰动分类和少样本学习任务的随机平滑认证技术转移到演讲者识别上来填补这一空白。我们在多个模型上对VoxCeleb 1和2数据集进行了实验,证明了这些方法的有效性。我们预计,这项工作将提高语音生物特征的鲁棒性,建立一个新的认证基准,并加速音频领域认证方法的研究。
https://arxiv.org/abs/2404.18791
Presently, the task of few-shot object detection (FSOD) in remote sensing images (RSIs) has become a focal point of attention. Numerous few-shot detectors, particularly those based on two-stage detectors, face challenges when dealing with the multiscale complexities inherent in RSIs. Moreover, these detectors present impractical characteristics in real-world applications, mainly due to their unwieldy model parameters when handling large amount of data. In contrast, we recognize the advantages of one-stage detectors, including high detection speed and a global receptive field. Consequently, we choose the YOLOv7 one-stage detector as a baseline and subject it to a novel meta-learning training framework. This transformation allows the detector to adeptly address FSOD tasks while capitalizing on its inherent advantage of lightweight. Additionally, we thoroughly investigate the samples generated by the meta-learning strategy and introduce a novel meta-sampling approach to retain samples produced by our designed meta-detection head. Coupled with our devised meta-cross loss, we deliberately utilize ``negative samples" that are often overlooked to extract valuable knowledge from them. This approach serves to enhance detection accuracy and efficiently refine the overall meta-learning strategy. To validate the effectiveness of our proposed detector, we conducted performance comparisons with current state-of-the-art detectors using the DIOR and NWPU VHR-10.v2 datasets, yielding satisfactory results.
目前,远红外图像(RSI)中的少样本目标检测(FSOD)任务已成为一个关注点。许多基于两阶段检测的少样本检测器在处理RSI中的多尺度复杂性时面临挑战。此外,这些检测器在实际应用中表现出了不切实际的特征,主要原因是他们在处理大量数据时的松散模型参数。相比之下,我们认识到一阶段检测器的优势,包括高检测速度和全局接收视野。因此,我们选择YOLOv7作为 baseline,并将其置于一种新的元学习训练框架中。这种转换使检测器能够有效地处理FSOD任务,同时充分利用其轻量化的优势。此外,我们深入研究了元学习策略生成的样本,并引入了一种新的元抽样方法,以保留由我们设计的元检测头产生的样本。结合我们设计的元交叉损失,我们故意利用经常被忽视的“负样本”来提取有价值的信息。这种方法旨在提高检测精度并有效地优化整个元学习策略。为了验证我们提出的检测器的有效性,我们使用DIOR和NWPU VHR-10.v2数据集与当前最先进的检测器进行性能比较,得到满意的结果。
https://arxiv.org/abs/2404.18426
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
传统的文本转语音(TTS)研究主要集中在提高训练数据中合成语音的质量,特别是对于训练数据中的说话人。为训练数据中的未见、无数据来源的说话人合成自然流畅的语音仍然是一个显著且未解决的问题。尽管已经探索了零 shot 或少数 shot 的说话人自适应TTS方法,但它们存在许多局限性。零 shot 方法往往不足以复制具有严重口音的说话人的声音,而少数 shot 方法虽然可以复制高度变化的口音,但会带来显著的存储负担和过拟合和灾难性忘记的风险。此外,之前的方法只提供了零 shot 或少数 shot 的适应性,限制了它们在不同现实场景中的使用。除此之外,大多数对说话人自适应TTS的评估仅在本土说话人的数据集上进行,无意中忽略了具有不同口音的非本土说话人。我们提出的框架将零 shot 和少数 shot 的适应性策略统一起来,我们称之为“即兴”和“细粒度”适应性策略,基于它们的优点。为了缓解零 shot 说话人适应性中观察到的不足,我们设计了两个创新的分歧器和引入了语音解码器的记忆机制。为了防止灾难性忘记和减少零 shot 说话人适应性中的存储影响,我们设计了两个适配器和一种独特的适应程序。
https://arxiv.org/abs/2404.18094
Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using Large Language Models (LLMs) for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.
日志在现代软件开发中非常重要,因为它包含了运行时信息。日志解析是许多基于日志的分析的第一步,涉及从无结构日志数据中提取有结构信息的任务。由于日志格式的多样性,传统的日志解析器在准确解析日志方面面临挑战,这直接影响了下游日志分析任务的性能。在本文中,我们探讨了使用大型语言模型(LLMs)进行日志解析的潜力,并提出了基于生成LLM的LLMParser,一种基于LLM的日志解析器,以及基于极少样本调整的LLMParser。我们使用了四种LLM,Flan-T5-small,Flan-T5-base,LLaMA-7B和ChatGLM-6B在LLMParsers中。我们对16个开源系统的评估结果表明,LLMParser具有统计学上显著更高的解析准确性(平均解析准确性为96%)。我们还对训练大小、模型大小和预训练LLM对日志解析准确性的影响进行了全面的实证分析。我们发现,较小的LLM可能比较大的LLM更有效;例如,当Flan-T5-base的推理时间与LLaMA-7B相当时。我们还发现,使用来自其他系统的日志训练LLM并不总是能提高解析准确性。虽然使用预训练的Flan-T5-base在准确性上有所提高,但预训练的LLaMA结果却导致了几乎55%的准确率下降(组准确率下降近55%)。总之,我们的研究提供了使用LLMs进行日志解析的实证证据,并突出了LLM-based log parser的局限性和未来研究的方向。
https://arxiv.org/abs/2404.18001
We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.
我们提出了一个几 shot 基准,由 7 个波兰语特有的分类任务组成。我们使用各种预训练的商业和开源模型进行了比较,包括微调、线性探测、SetFit 和上下文学习(ICL)。我们的研究结果表明,ICL 取得了最佳性能,商业模型如 GPT-3.5 和 GPT-4 也取得了最佳性能。然而,我们的最佳几 shot 学习得分与整个训练数据集上 HerBERT-large 的性能之间存在显著的 14 个百分点差距。在技术方面,SetFit 成为第二好的方法,紧随其后的是线性探测。我们观察到,非线性头微调的性能最差,最不稳定。ICL 的结果表明,持续在波兰语数据集上预训练像 Mistral-7b 或 Llama-2-13b 这样的模型是有益的。这得到了 Bielik-7b 和 Trurl-13b 分别改善性能的证实。为了进一步支持波兰几 shot 学习的实验,我们发布了 ICL 自定义模板。
https://arxiv.org/abs/2404.17832
Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment \textsc{Micre} on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. \textsc{Micre} delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that \textsc{Micre} can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
关系提取(RE)是识别文本中实体之间关系的重要任务。虽然大型语言模型(LLMs)已经在一般零和少样本学习方面展示了令人瞩目的表现,但最近的研究表明,当前的LLMs在零和少样本RE方面仍然存在困难。以前的研究主要是致力于设计提示格式和选择好的示例来提高基于ICL的RE。尽管这两点对ICL至关重要,但只要一个能根本性地提高LLM在RE中的ICL能力,零和少样本RE通过ICL的性能就会显著提高。因此,我们引入了\textsc{Micre}(LLMs为关系提取的元训练框架,\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction),一种新的元训练框架,用于零和少样本RE,该框架使LLM在多样的RE数据集上进行关系提取(RE)时能够进行基于上下文的ICL学习(即在RE中学习上下文)。通过元训练,模型在推理时通过几组训练示例进行ICL学习,无需进行参数更新或任务特定模板,从而实现更好的零和少样本任务泛化。我们在各种LLM模型规模和12个公共RE数据集上进行实验,然后将\textsc{Micre}应用于各种LLM,并在零和少样本设置下评估其性能。与一系列基线相比,\textsc{Micre}在包括监督微调在内的各种方法中具有相似或卓越的性能。我们发现,对于较大的模型规模,收益尤为明显,而使用具有多样性的元训练RE数据集是提高改进的关键。通过实证研究,我们发现\textsc{Micre}可以通过关系标签名称在目标RE数据集上进行关系语义知识转移。
https://arxiv.org/abs/2404.17807