Tables, figures, and listings (TFLs) are essential tools for summarizing clinical trial data. Creation of TFLs for reporting activities is often a time-consuming task encountered routinely during the execution of clinical trials. This study explored the use of large language models (LLMs) to automate the generation of TFLs through prompt engineering and few-shot transfer learning. Using public clinical trial data in ADaM format, our results demonstrated that LLMs can efficiently generate TFLs with prompt instructions, showcasing their potential in this domain. Furthermore, we developed a conservational agent named Clinical Trial TFL Generation Agent: An app that matches user queries to predefined prompts that produce customized programs to generate specific predefined TFLs.
表格、图表和列表(TFLs)是总结临床试验数据的关键工具。为报告活动创建TFL通常是临床试验执行过程中经常遇到的时间耗费的任务。本研究探讨了利用大型语言模型(LLMs)通过提示工程和少样本迁移学习自动生成TFL的可能性。使用ADaM格式的公开临床试验数据,我们的结果表明,LLMs可以有效地生成带有提示指令的TFL,展示了它们在这个领域的前景。此外,我们开发了一个名为Clinical Trial TFL生成代理:一个应用程序,将用户查询与预定义提示匹配,生成定制化的程序以生成特定的预定义TFL。
https://arxiv.org/abs/2409.12046
As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}
随着像CLIP这样的强大预训练视觉语言模型(VLMs)越来越受到关注,许多研究试图将VLMs用于下游任务。在这些研究中,提示学习被证明是一种有效的适应新任务的方法,只需几个参数。然而,当前的提示学习方法面临两个挑战:首先,单个软提示很难捕捉数据集中多样化的风格和模式;其次,微调软提示很容易过拟合。为了应对这些挑战,我们提出了一个结合软提示学习的方法,包括一个路由模块。这个模块能够捕捉数据集的多样风格,动态地为每个实例选择最合适的提示。此外,我们还引入了一种新颖的门控机制,确保路由器根据提示的相似性选择提示,这保留来自硬提示的知识,并提高了选择精度。我们还在每个软提示上应用了语义分组文本级监督,为每个组分配一个自定义模板的token嵌入,并在结果文本特征和硬提示编码文本特征之间应用对比损失。这种监督确保了从软提示中提取的文本特征接近其对应硬提示,保留初始知识并减轻过拟合。我们的方法在11个数据集上的验证表明,与现有基线相比,在几 shot学习、领域泛化和新基点映射场景中取得了显著的改进。代码将在\url{https://anonymous.4open.science/r/mocoop-6387}中提供。
https://arxiv.org/abs/2409.12011
Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.
大语言模型(LLMs)在零 shot 和零散学习设置中已经实现了各种应用,包括为训练和测试生成合成数据。然而,为了可靠地使用这些合成数据,了解它们在现实世界数据中的代表性非常重要。我们通过评估通过LLM生成合成数据的有效性,并将其用作各种自然语言处理任务的基准,来研究这个问题。在六个数据集和三个不同任务上的实验中,我们的实验结果表明,虽然合成数据可以有效地捕捉各种方法在简单任务上的表现,比如意图分类,但它对于更复杂的任务,比如命名实体识别,仍然存在不足。此外,我们提出了一种新的指标,称为偏差因子,用于评估当相同LLM用于生成基准数据和执行任务时引入的偏见。我们发现,较小的LLM倾向于对自己的生成数据存在偏见,而较大的模型则不存在这种偏见。总体而言,我们的研究结果表明,合成数据作为基准的有效性取决于任务,并且实践者应尽可能地依赖来自多个较大模型的数据。
https://arxiv.org/abs/2409.11968
Few-shot class-incremental learning (FSCIL) aims to incrementally recognize new classes using a few samples while maintaining the performance on previously learned classes. One of the effective methods to solve this challenge is to construct prototypical evolution classifiers. Despite the advancement achieved by most existing methods, the classifier weights are simply initialized using mean features. Because representations for new classes are weak and biased, we argue such a strategy is suboptimal. In this paper, we tackle this issue from two aspects. Firstly, thanks to the development of foundation models, we employ a foundation model, the CLIP, as the network pedestal to provide a general representation for each class. Secondly, to generate a more reliable and comprehensive instance representation, we propose a Knowledge Adapter (KA) module that summarizes the data-specific knowledge from training data and fuses it into the general representation. Additionally, to tune the knowledge learned from the base classes to the upcoming classes, we propose a mechanism of Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL. Taken together, our proposed method, dubbed as Knowledge Adaptation Network (KANet), achieves competitive performance on a wide range of datasets, including CIFAR100, CUB200, and ImageNet-R.
少样本分类增强学习(FSCIL)旨在通过使用几个样本逐步识别新的类别,同时保持之前学习类别的性能。解决这个挑战的有效方法之一是构建原型进化分类器。尽管大多数现有方法已经取得了进步,但分类器的权重只是使用平均特征初始化。由于新类别的表示弱和有偏差,我们认为这种策略是次优的。在本文中,我们从两个方面解决这个问题。首先,由于基础模型的开发,我们使用基础模型CLIP作为网络支架,提供每个类别的通用表示。其次,为了生成更可靠和全面的实例表示,我们提出了一个知识适配器(KA)模块,它从训练数据中汇总数据特异性知识并将其融合到通用表示中。此外,为了将基类知识应用于即将到来的类别,我们提出了一种基于模拟实际FSCIL的增量伪序列学习(IPEL)的机制。将它们放在一起,我们提出的名为知识适应网络(KANet)的方法在包括CIFAR100、CUB200和ImageNet-R等广泛的数据集上实现了竞争力的性能。
https://arxiv.org/abs/2409.11770
3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at this https URL.
3D高斯平铺作为一种强大的3D场景表示技术,高效率地捕捉到细小细节。在本文中,我们介绍了一种新的基于投票的方法,将2D分割模型扩展到3D高斯平铺。我们的方法利用遮罩梯度,其中梯度通过输入2D遮罩进行过滤,然后用于投票以实现精确分割。作为附加品,我们还发现推理时间梯度也可以用于剪枝高斯,从而实现高达21%的压缩。此外,我们探讨了几 shot可扩展性转移,允许2D图像的注释有效地转移到3D高斯平铺。这种方法的基础是简单而 robust 的数学公式,使其成为许多下游应用的高效工具,如增强现实(AR),物体编辑和机器人。项目代码和其他资源可在此处访问:https:// URL。
https://arxiv.org/abs/2409.11681
Few-shot class-incremental learning is crucial for developing scalable and adaptive intelligent systems, as it enables models to acquire new classes with minimal annotated data while safeguarding the previously accumulated knowledge. Nonetheless, existing methods deal with continuous data streams in a centralized manner, limiting their applicability in scenarios that prioritize data privacy and security. To this end, this paper introduces federated few-shot class-incremental learning, a decentralized machine learning paradigm tailored to progressively learn new classes from scarce data distributed across multiple clients. In this learning paradigm, clients locally update their models with new classes while preserving data privacy, and then transmit the model updates to a central server where they are aggregated globally. However, this paradigm faces several issues, such as difficulties in few-shot learning, catastrophic forgetting, and data heterogeneity. To address these challenges, we present a synthetic data-driven framework that leverages replay buffer data to maintain existing knowledge and facilitate the acquisition of new knowledge. Within this framework, a noise-aware generative replay module is developed to fine-tune local models with a balance of new and replay data, while generating synthetic data of new classes to further expand the replay buffer for future tasks. Furthermore, a class-specific weighted aggregation strategy is designed to tackle data heterogeneity by adaptively aggregating class-specific parameters based on local models performance on synthetic data. This enables effective global model optimization without direct access to client data. Comprehensive experiments across three widely-used datasets underscore the effectiveness and preeminence of the introduced framework.
少量样本分类递增学习对开发可扩展和自适应智能系统至关重要,因为它允许模型在稀疏注释数据的情况下积累先前的知识,同时保护先前的知识。然而,现有的方法将连续数据流集中处理,限制其在重视数据隐私和安全场景的适用性。为此,本文介绍了一种联邦少量样本分类递增学习,一种专为从分散的多个客户端的稀疏数据中逐步学习新类别的分布式机器学习范式。在这个学习范式中,客户端会根据新的类别局部更新模型,同时保留数据隐私,然后将模型更新传输到集中的服务器,在那里进行全局聚合。然而,这种范式存在几个问题,例如稀疏学习困难、灾难性遗忘和数据异质性。为解决这些问题,我们提出了一个基于合成数据驱动的框架,利用回放缓冲数据来保留现有知识,并促进对新知识的获取。在这个框架内,我们设计了一个噪声感知生成式回放模块,通过平衡新数据和回放数据,对本地模型进行微调,同时生成新类别的合成数据,进一步扩展回放缓冲区以应对未来的任务。此外,我们还设计了一个针对类的加权聚合策略,通过根据合成数据上本地模型的表现动态地聚合类特定参数,实现有效的全局模型优化,同时无需直接访问客户端数据。通过在三个广泛使用数据集上的全面实验,我们证实了所提出的框架的有效性和卓越性。
https://arxiv.org/abs/2409.11657
Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.
结核病(TB)是由结核杆菌(Mycobacterium tuberculosis)引起的,主要影响肺。早期诊断对提高治疗效果和降低传播风险至关重要。人工智能(AI),特别是通过胸部X光片图像分类,可以在TB诊断中发挥作用。然而,TB胸部X光片数据集中的分类不平衡给准确分类带来了挑战。在本文中,我们提出了一种使用原型网络(Prototypical Network)算法的几 shot学习(FSL)方法来解决这一问题。我们比较了ResNet-18、ResNet-50和VGG16在TBX11K胸部X光片数据集中的特征提取方面的性能。实验结果表明,ResNet-18的分类准确率为98.93%,ResNet-50的分类准确率为98.60%,而VGG16的分类准确率为33.33%。这些发现表明,与其它方法相比,所提出的方法在缓解数据不平衡方面表现优异,这对疾病分类应用特别有益。
https://arxiv.org/abs/2409.11644
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
为了适应少样本分类,已经提出了许多方法来对预训练的CLIP模型进行修改。由于CLIP在大数据集上训练,因此通过适应少样本分类表现良好。在这项工作中,我们分析图像空间中内模型的嵌入表示。我们的分析表明,由于对比学习,CLIP模型的嵌入在成对和未成对实例之间在图像空间表现出高余弦相似性分布重叠,这会影响那些基于图像空间预测的少样本训练 free 的分类算法的准确性。为了应对内模冲突,我们提出了一个轻量级的适配器,在Google Open Images数据集中的通用样本上进行训练,证明了这对少样本训练free分类的准确性有所提高。通过广泛的实证分析验证我们的贡献,我们发现减少内模冲突会导致以下结果:a) 在多个标准数据集上的性能提高;b) 对分布漂移的增加容错性;c) 提高下游任务的特征变异性,使得特征更具判别性。
https://arxiv.org/abs/2409.11338
Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. this https URL
少样本语义分割解决了在查询图像中仅使用少量注释示例对物体进行分割的挑战。然而,许多先前的最先进方法要么必须舍弃精致的局部语义特征,要么遭受高计算复杂性的困扰。为了应对这些挑战,我们提出了一种基于Transformer架构的新少样本语义分割框架。我们的方法引入了空间Transformer解码器和上下文掩码生成模块,以改善支持图像和查询图像之间的语义关系。此外,我们还引入了多尺度解码器,通过以分层方式整合不同分辨率特征来优化分割掩码。此外,我们的方法将全局特征从中间编码器阶段集成,以提高上下文理解,同时保持轻量化的结构以减少复杂性。这种性能和效率的平衡使我们的方法在1-shot和5-shot设置下的基准数据集上实现最先进的结果。值得注意的是,仅含有150万参数的模型在克服现有方法局限性的同时,表现出竞争力的性能。
https://arxiv.org/abs/2409.11316
Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.
身体化日常任务是身体化人工智能社区中的一种流行任务,要求代理根据自然语言指令进行一系列动作,并基于视觉观察结果进行推理。传统基于学习的方法面临两个挑战。首先,自然语言指令通常缺乏明确的任务规划。其次,要使模型具有任务环境的知识,需要进行大量训练。基于大型语言模型(LLM)的前置作品要么由于缺乏任务特定知识而性能较差,要么依赖于少量的通过地面真实现状进行训练。为了应对上述局限,我们提出了一个名为渐进式检索增强生成(P-RAG)的新方法,该方法不仅有效地利用了LLM的强大的语言处理能力,而且通过不断地积累任务特定知识,而不依赖于地面真实现状。与传统的RAG方法相比,P-RAG采用了一种迭代方法来逐步更新数据库。在每次迭代中,P-RAG检索最新的数据库,并从之前的交互中获得历史信息作为经验参照,用于当前交互。此外,我们还引入了一个更加细粒度的检索方案,不仅检索相似的任务,还包含了检索相似的情况,为用户提供更有价值的参考经验。大量的实验证明,在没有利用地面真实现状的情况下,P-RAG取得了竞争力的结果,甚至可以通过自迭代进一步改善性能。
https://arxiv.org/abs/2409.11279
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items, rather than one prototype per label. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music, and is evaluated against existing approaches in the literature. The results demonstrate a significant performance improvement in almost all domains and training setups when using LC-Protonets for multi-label classification. In addition to training a few-shot learning model from scratch, we explore the use of a pre-trained model, obtained via supervised learning, to embed items in the feature space. Fine-tuning improves the generalization ability of all methods, yet LC-Protonets achieve high-level performance even without fine-tuning, in contrast to the comparative approaches. We finally analyze the scalability of the proposed method, providing detailed quantitative metrics from our experiments. The implementation and experimental setup are made publicly available, offering a benchmark for future research.
我们提出了Label-Combination Prototypical Networks(LC-Protonets)来解决多标签少样本分类问题,其中模型必须根据仅有的几个训练样本来对新的类别进行泛化。扩展原型网络,LC-Protonets在有限的训练项目对标签集的功率上生成一个原型,而不是为每个标签生成一个原型。将我们的方法应用于跨多个音乐数据集的自动音频标签,涵盖各种文化和传统音乐,并评估 literature 中现有方法的性能。结果表明,在几乎所有领域和训练设置中,使用LC-Protonets进行多标签分类都取得了显著的性能提升。此外,我们还探索了通过监督学习获取预训练模型,并将物品嵌入特征空间的效果。微调提高了所有方法的泛化能力,但是LC-Protonets在无需微调的情况下也达到了高水平的性能,与比较方法相比更加卓越。最后,我们分析了所提出方法的可扩展性,并提供了从实验中得到的详细定量指标。实现和实验设置都是公开的,为未来的研究提供了一个基准。
https://arxiv.org/abs/2409.11264
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few-shot segmentation setting in remote sensing, we are the first to propose a generalized few-shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few-shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.
在各种应用中,包括遥感领域,使用有限标注数据进行学习是一个具有挑战性的问题。少样本语义分割是一种方法,可以鼓励深度学习模型从训练数据中学习新颖类别的少数标注示例。一般少样本分割设置还有一个挑战,即鼓励模型不仅适应新颖类别,还保持对训练基础类别的强烈性能。虽然之前的数据集和基准讨论了在遥感领域中的少样本分割设置,但我们是我们第一个提出一般少样本分割基准的人。这种一般设置更加真实和具有挑战性,需要考虑在遥感背景下进行探索。我们在OpenEarthMap数据集上进行数据增强,为一般少样本评估设置添加了更多标签。数据集在L3D-IVU工作组的OpenEarthMap土地覆盖制图通用少样本分割挑战中发布,与CVPR 2024一并。在这篇论文中,我们除了总结数据集和挑战细节外,还为验证和测试集提供了基准结果。
https://arxiv.org/abs/2409.11227
Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can be beneficial to depict the problem-solving process as well. In this paper, we proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM to generate an initial response, then expresses intermediate problem-solving steps to a graph structure. After that, it employs graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches. Our code is released at this https URL.
大语言模型(LLMs)通过上下文学习(ICL)技术表现出惊人的少样本学习能力,并将自然语言处理任务的范式统一起来。尽管ICL取得了成功,但示例演示的质量会显著影响LLM的性能。现有的示例选择方法主要关注查询和候选示例之间的语义相似性。另一方面,推理步骤之间的逻辑联系对于描述问题解决过程是有益的。在本文中,我们提出了名为推理图增强示例检索(RGER)的新方法。RGER首先请求LLM生成初始响应,然后将中间问题解决步骤表达为图形结构。接下来,它采用图核来选择具有语义和结构相似性的示例。大量实验证明,结构关系对于查询和候选示例的对齐非常重要。RGER在数学和逻辑推理任务上的效果表明,其优于基于检索的现有方法。我们的代码已发布在https:// this URL上。
https://arxiv.org/abs/2409.11147
In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emph{near} OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.
在面向任务的对话系统中,一个健壮的意图检测机制必须有效地处理在现实场景中遇到的畸形输入语句。本研究提出了一种新的细粒度调整框架,旨在增强归一化(ID)意图分类和离散化(OOD)意图检测,该框架利用LLM的极具区分性的表示来构建每个ID类的语义原型。通过利用LLM的高度区分性表示,我们使用多样性驱动的提示调整方法为每个ID类构建语义原型。我们在具有挑战性的离散化环境中严格测试我们的框架,其中ID和OOD类在语义上密切相关但不同,被称为\emph{near} OOD检测。为了进行全面的评估,我们将我们的方法与常见的微调方法进行了比较。实验结果表明,我们的方法在少样本ID意图分类和近距离OOD意图检测任务中表现出卓越的性能。
https://arxiv.org/abs/2409.11114
Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than $2\%$ of the parameters.
学会的图像压缩(LIC)已经达到最先进的速率失真性能,被视为下一代图像压缩技术的有希望的成果。然而,预训练的LIC模型通常在应用于非训练领域图像时性能显著下降,这表明其泛化能力较差。为解决这个问题,我们提出了一种基于插件和播放器的LIC模型改进方法,将可插拔的适配器集成到预训练模型中。从潜在通道和频率分量的类比中得到启发,我们研究了LIC中的领域空白,并观察到非训练领域图像会破坏预训练通道的分解。因此,我们引入了一种基于卷积基的适配器和低秩适配器的通道级重新分配方法,这种方法轻巧且与主流LIC方案兼容。在多个领域和多个代表性的LIC方案之间进行广泛的实验,我们的方法显著增强了预训练模型,与仅使用25个目标领域样本的H.266/VVC内部编码的性能相当。此外,我们的方法在参数传输量上与全模型微调相匹敌。
https://arxiv.org/abs/2409.11111
Large Language Models (LLMs) have supplanted traditional methods in numerous natural language processing tasks. Nonetheless, in Named Entity Recognition (NER), existing LLM-based methods underperform compared to baselines and require significantly more computational resources, limiting their application. In this paper, we introduce the task of generation-based extraction and in-context classification (GEIC), designed to leverage LLMs' prior knowledge and self-attention mechanisms for NER tasks. We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER. CascadeNER employs model cascading to utilize two small-parameter LLMs to extract and classify independently, reducing resource consumption while enhancing accuracy. We also introduce AnythingNER, the first NER dataset specifically designed for LLMs, including 8 languages, 155 entity types and a novel dynamic categorization system. Experiments show that CascadeNER achieves state-of-the-art performance on low-resource and fine-grained scenarios, including CrossNER and FewNERD. Our work is openly accessible.
大语言模型(LLMs)在许多自然语言处理任务中已经取代了传统的算法。然而,在命名实体识别(NER)中,现有的基于LLM的解决方案在基线模型上表现不佳,需要大量的计算资源,从而限制了它们的应用。在本文中,我们提出了基于生成的提取和上下文分类(GEIC)任务,旨在利用LLM的先前知识和自注意力机制为NER任务提供优势。然后,我们提出了CascadeNER,一种通用的多语言GEIC框架,用于零 shot和很少 shot NER。CascadeNER采用级联模型来利用两个小参数的LLM来提取和分类,从而降低资源消耗,同时提高准确性。我们还引入了AnythingNER,专为LLM设计的第一个NER数据集,包括8种语言,155个实体类型和一种新颖的动态分类系统。实验证明,CascadeNER在低资源和高精度的场景中实现了最先进的性能,包括CrossNER和FewNERD。我们的工作是公开可访问的。
https://arxiv.org/abs/2409.11022
For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.
为了更有效地泛化到未见过的领域(类别),大多数Few-shot Segmentation(FSS)方法会直接利用预训练的编码器,并仅在当前大模型时代微调解码器,特别是在这种情况下。然而,这样的预训练特征编码器往往不关注目标类别,从而激活与目标类别无关的对象。相比之下,人类可以轻松地关注当前视野中的特定对象。本文模仿了人类的视觉感知模式,并提出了一个新颖且强大的提示驱动方案,称为“提示和传递”(PAT),以构建动态分类感知提示范式,将编码器聚焦于当前任务中感兴趣的对象(目标类别)。 以下是三个关键点以提高提示: 1)跨模态语言信息被引入以初始化每个任务的提示。 2)语义提示转移(SPT)将图像中的类特定语义准确地传递到提示中。 3)部分掩码生成器(PMG)与SPT协同工作,为不同个体生成不同但互补的部分提示。 令人惊讶的是,PAT在包括标准FSS、跨领域FSS(如CV、医疗和遥感领域)、弱标签FSS和零散分割在内的四个任务上实现了竞争性的性能,将新状态置于11个基准上。
https://arxiv.org/abs/2409.10389
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
我们提出了一个新颖的基于频率的自监督学习(SSL)方法,显著增强了其在预训练方面的效果。此方向的先驱工作遮蔽了输入图像中的预定义频率,并使用重构损失来预训练模型。尽管取得了很好的效果,但如我们在论文中所述,这种实现存在两个基本局限。首先,使用预定义频率会忽视图像频率响应的变异性。其次,通过使用频率过滤的图像进行预训练,预训练后的模型在微调过程中需要更多的数据来适应自然外观的图像。为了应对这些缺点,我们提出了Fourrier变换压缩与自知识蒸馏(FOLK)相结合的方法,结合了两个专门的想法。首先,受到图像压缩的启发,我们根据图像频率响应动态选择遮蔽的频率,为预训练创造了更合适的SSL任务。其次,我们使用由知识蒸馏支持的两个分支框架,使模型能够同时接受滤波和原始图像作为输入,从而大大减轻下游任务的负担。我们的实验结果表明,FOLK在实现对许多最先进的SSL方法的竞争性能方面具有有效性,包括图像分类、少样本学习和语义分割等任务。
https://arxiv.org/abs/2409.10362
We propose a self-supervised model producing 3D anatomical positional embeddings (APE) of individual medical image voxels. APE encodes voxels' anatomical closeness, i.e., voxels of the same organ or nearby organs always have closer positional embeddings than the voxels of more distant body parts. In contrast to the existing models of anatomical positional embeddings, our method is able to efficiently produce a map of voxel-wise embeddings for a whole volumetric input image, which makes it an optimal choice for different downstream applications. We train our APE model on 8400 publicly available CT images of abdomen and chest regions. We demonstrate its superior performance compared with the existing models on anatomical landmark retrieval and weakly-supervised few-shot localization of 13 abdominal organs. As a practical application, we show how to cheaply train APE to crop raw CT images to different anatomical regions of interest with 0.99 recall, while reducing the image volume by 10-100 times. The code and the pre-trained APE model are available at this https URL .
我们提出了一个自监督的3D解剖位置嵌入(APE)模型,用于生成单个医学图像体素(voxels)的3D解剖位置嵌入。APE编码了体素的解剖接近度,即同一器官或附近器官的体素总是比距离更远的身体部位的体素具有更接近的位置嵌入。与现有的解剖位置嵌入模型相比,我们的方法能够有效地生成整个体积输入图像的体素级嵌入地图,这使得它成为不同下游应用的最佳选择。我们在 abdomen 和 chest 区域训练了 8400 个公开可用的 CT 图像作为 APE 模型的预训练数据。我们证明了它在解剖点检索和弱监督下对13个腹部器官的定位具有优越性能。作为实际应用,我们展示了如何以0.99的召回率将 APE 训练为裁剪原始 CT 图像以不同感兴趣的解剖区域,同时将图像体积减少10-100倍。代码和预训练的APE模型可在此链接处获取。
https://arxiv.org/abs/2409.10291
Domain-specific Named Entity Recognition (NER), whose goal is to recognize domain-specific entities and their categories, provides an important support for constructing domain knowledge graphs. Currently, deep learning-based methods are widely used and effective in NER tasks, but due to the reliance on large-scale labeled data. As a result, the scarcity of labeled data in a specific domain will limit its application.Therefore, many researches started to introduce few-shot methods and achieved some results. However, the entity structures in specific domains are often complex, and the current few-shot methods are difficult to adapt to NER tasks with complex features.Taking the Chinese coal chemical industry domain as an example,there exists a complex structure of multiple entities sharing a single entity, as well as multiple relationships for the same pair of entities, which affects the NER task under the sample less this http URL this paper, we propose a Large Language Models (LLMs)-based entity recognition framework LLM-DER for the domain-specific entity recognition problem in Chinese, which enriches the entity information by generating a list of relationships containing entity types through LLMs, and designing a plausibility and consistency evaluation method to remove misrecognized entities, which can effectively solve the complex structural entity recognition problem in a specific domain.The experimental results of this paper on the Resume dataset and the self-constructed coal chemical dataset Coal show that LLM-DER performs outstandingly in domain-specific entity recognition, not only outperforming the existing GPT-3.5-turbo baseline, but also exceeding the fully-supervised baseline, verifying its effectiveness in entity recognition.
领域特定命名实体识别(NER)的目标是识别领域特定的实体及其类别,为构建领域知识图提供重要支持。目前,基于深度学习的NER方法在NER任务中应用广泛且有效,但由于依赖大规模带标签数据,因此,在特定领域中带标签数据的稀缺性将限制其应用。因此,许多研究者开始引入少样本方法,并取得了一定的成果。然而,特定领域中实体结构的复杂性往往导致现有少样本方法难以适应具有复杂特征的NER任务。以中国煤炭化学工业领域为例,存在多个实体共享单个实体以及相同实体之间多个关系的情况,这影响了在样本较少的情况下进行NER任务的性能。因此,本文基于大型语言模型(LLMs)的实体识别框架 LLM-DER,提出了一个解决领域特定实体识别问题的框架,通过生成包含实体类型的关系列表并通过LLMs设计可验证性和一致性评估方法,可以有效地解决特定领域中的复杂实体识别问题。 本文在 Resume 数据集和自构建煤炭化学工业数据集的实验结果表明,LLM-DER 在领域特定实体识别方面表现出色,不仅超过了现有的 GPT-3.5-turbo 基线,而且超过了完全监督基线,验证了其在实体识别方面的有效性。
https://arxiv.org/abs/2409.10077