Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at this https URL.
开放集对象检测的目标是检测训练期间未观察到的任意类别。最近的进展都采用了开放词汇表范式,利用视觉语言骨架来表示类别用语言表示。在本文中,我们介绍了DE-ViT,它是一个开放集对象检测器,使用仅视觉的DINOv2骨架和通过绕过每个类别的推断来学习新类别,而不是使用语言。为了改善一般性检测能力,我们将多分类任务转换为二进制分类任务,而绕过每个类别的推断,并提出了一种新的区域传播技术来进行定位。我们评估了DE-ViT在开放词汇表、少量样本和一次性检测基准上的表现,与COCO和LVIS进行比较。对于COCO,DE-ViT在开放词汇表SoTA上比SoTA表现更好,在新类中达到了50AP50。DE-ViT在10次检测、30次检测和一次性检测SoTA上超过SoTA的15mAP、7.2mAP和2.8AP50。对于LVIS,DE-ViT比开放词汇表SoTA表现更好,达到了34.3 mask APr。代码在此httpsURL上可用。
Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.
任务导向对话系统(TOD)系统通过多回合对话协助用户执行各种任务,但大型语言模型(LLMs)往往难以理解这些复杂的上下文。在这个研究中,我们提出了一种新的“自我解释”引导策略,以增强多回合对话中的LLMs的理解能力。这种任务无关的方法要求模型在任务执行之前分析每个对话表述,从而提高在各种对话中心任务中的表现。从六个基准数据集的 experimental 结果来看,我们的方法和零样本引导相比,表现 consistently 更好,且与少量的引导效果相当或超过,表明它可能成为增强LLMs在复杂对话任务中理解能力的强大工具。
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
从给定的表格和文本中回答混合内容的问题是一项挑战性的任务。最近,大型语言模型(LLM)在自然语言处理社区中引起了广泛关注。随着大型语言模型的出现,上下文学习和思维链提示已成为该领域的两个最受欢迎的研究主题。在本文中,我们介绍了一种新的提示策略,称为混合提示策略,并介绍了在TextTableQA问题中的思维提取方法。通过上下文学习,我们提示模型在处理混合数据时发展检索思维的能力。我们的方法和在少量样本情况下 MultiHiertt 数据集上的全监督顶级结果相比,取得了更好的表现。
Instruction-tuned Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate It-LLMs' resilience abilities towards a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the It-LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.
Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks. However, they often fall short in some information extraction tasks, particularly those requiring domain-specific knowledge, such as Biomedical Named Entity Recognition (NER). In this paper, inspired by Chain-of-thought, we leverage the LLM to solve the Biomedical NER step-by-step: break down the NER task into entity span extraction and entity type determination. Additionally, for entity type determination, we inject entity knowledge to address the problem that LLM's lack of domain knowledge when predicting entity category. Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline. Additionally, the incorporation of external knowledge significantly enhances entity category determination performance.
The lack of annotated medical images limits the performance of deep learning models, which usually need large-scale labelled datasets. Few-shot learning techniques can reduce data scarcity issues and enhance medical image analysis, especially with meta-learning. This systematic review gives a comprehensive overview of few-shot learning in medical imaging. We searched the literature systematically and selected 80 relevant articles published from 2018 to 2023. We clustered the articles based on medical outcomes, such as tumour segmentation, disease classification, and image registration; anatomical structure investigated (i.e. heart, lung, etc.); and the meta-learning method used. For each cluster, we examined the papers' distributions and the results provided by the state-of-the-art. In addition, we identified a generic pipeline shared among all the studies. The review shows that few-shot learning can overcome data scarcity in most outcomes and that meta-learning is a popular choice to perform few-shot learning because it can adapt to new tasks with few labelled samples. In addition, following meta-learning, supervised learning and semi-supervised learning stand out as the predominant techniques employed to tackle few-shot learning challenges in medical imaging and also best performing. Lastly, we observed that the primary application areas predominantly encompass cardiac, pulmonary, and abdominal domains. This systematic review aims to inspire further research to improve medical image analysis and patient care.
Crafting an effective Automatic Speech Recognition (ASR) solution for dialects demands innovative approaches that not only address the data scarcity issue but also navigate the intricacies of linguistic diversity. In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. First, textual and audio data is collected and in some cases annotated. Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets; covering different acoustic, linguistic and prosodic conditions. Finally, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling inadequacies in our testing references. Our models, allowing to transcribe audio samples in a linguistic mix involving Tunisian Arabic, English and French, and all the data used during training and testing are released for public use and further improvements.
为方言制定有效的自动语音识别(ASR)解决方案需要创新性的方法,不仅要解决数据缺乏问题,还要解决语言多样性的微妙之处。在本文中,我们针对上述 ASR 挑战,重点是突尼斯方言。首先,我们对文本和音频数据进行了收集,并在一些情况下进行了注释。其次,我们探索了自我监督、半监督和少量的代码切换方法,在不同突尼斯测试集上推动最先进的技术,覆盖不同的声学、语言学和语用条件。最后,考虑到传统拼写方式的缺失,我们进行了人类评估,以避免我们在测试参考中的拼写不足引起的噪声。我们的模型允许在涉及突尼斯阿拉伯语、英语和法语的语言混合中录制音频样本,并所有在训练和测试期间使用的数据和数据都公开发布,并进行进一步的改进。
Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at this https URL.
少量点云语义分割的目标是训练模型,以在新 unseen 类只有少量支持样本的情况下,快速适应这些新类。然而,支持样本中的噪声假设在许多实际实际场景中很容易违反。在本文中,我们重点是在测试期间,在噪声支持样本的有害影响下,提高少量点云分割的鲁棒性。为此,我们提出了一种组件级别的干净噪声分离(CCNS)表示学习,以学习区分目标类干净样本和噪声样本的表示。利用我们的 CCNS 中的干净和噪声支持样本,我们进一步提出了一种多尺度度数based噪声抑制(MDNS)方案,以从支持样本中删除噪声样本。我们在两个基准数据集上进行了广泛的实验,研究了各种噪声设置。我们的结果表明,CCNS 和 MDNS 的组合显著提高了性能。我们的代码可在 this https URL 上获取。
Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more practical paradigm of generalized few-shot point cloud segmentation, which requires the model to generalize to new categories with only a few support point clouds and simultaneously retain the capability to segment base classes. We propose the geometric words to represent geometric components shared between the base and novel classes, and incorporate them into a novel geometric-aware semantic representation to facilitate better generalization to the new classes without forgetting the old ones. Moreover, we introduce geometric prototypes to guide the segmentation with geometric prior knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate the superior performance of our method over baseline methods. Our code is available at: this https URL.
现有的完全监督点云分割方法在出现新类别的动态测试环境中表现不佳。少量点云分割算法通过牺牲基准类分割精度来适应新类,这严重阻碍了其实用性。这主要激励我们提出一种更加实用的少量点云分割范式,该范式要求模型仅使用少量支持点云就泛化到新类别,同时保留分割基准类的能力。我们提议使用几何词汇来表示基准类和新类共享的几何组件,并将其融合到一个几何aware语义表示中,以更好地泛化到新类别,而不会忘记旧类。此外,我们引入了几何原型来指导基于几何知识的分割。在S3DIS和ScanNet等实验中,我们 consistently 证明了我们方法的优越性能优于基准方法。我们的代码可在以下https URL获得。
Few-shot Medical Image Segmentation (FSMIS) is a more promising solution for medical image segmentation tasks where high-quality annotations are naturally scarce. However, current mainstream methods primarily focus on extracting holistic representations from support images with large intra-class variations in appearance and background, and encounter difficulties in adapting to query images. In this work, we present an approach to extract multiple representative sub-regions from a given support medical image, enabling fine-grained selection over the generated image regions. Specifically, the foreground of the support image is decomposed into distinct regions, which are subsequently used to derive region-level representations via a designed Regional Prototypical Learning (RPL) module. We then introduce a novel Prototypical Representation Debiasing (PRD) module based on a two-way elimination mechanism which suppresses the disturbance of regional representations by a self-support, Multi-direction Self-debiasing (MS) block, and a support-query, Interactive Debiasing (ID) block. Finally, an Assembled Prediction (AP) module is devised to balance and integrate predictions of multiple prototypical representations learned using stacked PRD modules. Results obtained through extensive experiments on three publicly accessible medical imaging datasets demonstrate consistent improvements over the leading FSMIS methods. The source code is available at this https URL.
few-shot Medical Image Segmentation (FSMIS) 是一种在高质量注释自然稀缺的医疗图像分割任务中更具前途的解决方案。然而,目前的主流方法主要关注从支持图像中提取全局表示,支持图像的外观和背景差异很大,并且难以适应查询图像。在本文中,我们提出了一种方法,从给定的支持医疗图像中提取多个代表性子区域,以便在生成的图像区域中进行精细选择。具体来说,支持图像的前端被分解为不同的区域,这些区域随后用于通过设计的区域典型表示学习(RPL)模块推导区域级别的表示。我们介绍了一种基于双向消除机制的新的典型表示表示消除(PRD)模块,该模块通过抑制支持图像中的区域表示干扰来抑制自支持、多方向自消除(MS)块和支持-查询交互消除(ID)块的影响。最后,我们介绍了一种组合预测(AP)模块,旨在平衡和整合使用叠加的PRD模块学习到的多个典型表示的预测。通过在三个公开可用的医疗成像数据集上进行广泛的实验,取得了与FSMIS方法一致的改进。源代码可在本URL上获取。
Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. We decompose the query video information into a clip prototype and a memory prototype for capturing local and long-term internal temporal guidance, respectively. Frame prototypes are further used for each frame independently to handle fine-grained adaptive guidance and enable bidirectional clip-frame prototype communication. To reduce the influence of noisy memory, we propose to leverage the structural similarity relation among different predicted regions and the support for selecting reliable memory frames. Furthermore, a new segmentation loss is also proposed to enhance the category discriminability of the learned prototypes. Experimental results demonstrate that our proposed video IPMT model significantly outperforms previous models on two benchmark datasets. Code is available at this https URL.
有限帧视频对象分割(FSVOS)旨在将查询视频中的同类别对象根据几个注释支持图像划分为不同的帧。然而,这个任务很少被探索。在本研究中,基于IPMT,一种先进的有限帧图像分割方法,它将外部支持指南信息与自适应查询指南 cues相结合,我们提议利用多粒度的时间指南信息来处理视频数据的时间相关性质。我们将查询视频信息分解为片段原型和记忆原型,分别用于捕获局部和长期内部时间指南。帧原型还将用于每个帧 independently 处理精细的自适应指南,并实现双向片段-帧原型通信。为了减少噪声内存的影响,我们提议利用不同预测区域之间的结构相似性关系和支持选择可靠的记忆帧。此外,我们还提出了一种新的分割损失,以增强学习原型的分类互异性。实验结果显示,我们提出的视频IPMT模型在两个基准数据集上显著优于以前的模型。代码在此httpsURL上可用。
In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.
在本研究中,我们研究在几个样本量下对文本到音乐扩散模型进行个性化处理。由于计算机视觉领域的最新进展,我们是第一个探索将预先训练的文本到音频扩散器和两个已知的个性化方法结合起来的人。我们进行了实验,探索 audio-specific 数据增强对整体系统性能的影响,并评估了不同的训练策略。为了评估,我们创造了一个带有提示和音乐片段的新数据集。我们考虑了基于嵌入和音乐特定的度量指标进行定量评估,同时也进行了用户研究进行定性评估。我们的分析表明,相似度度量与用户偏好一致,而当前的个性处理方法更倾向于学习节奏音乐构造比旋律更容易。本文代码、数据集和示例材料已公开向学术界。
In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window, which makes it difficult to fit a sufficient number of examples in the prompt. In this paper, we use a pre-trained dense retrieval model to bypass this limitation, giving the model only a partial view of the full label space for each inference call. Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art performance in few-shot settings for three common intent classification datasets, with no finetuning. We also surpass fine-tuned performance on fine-grained sentiment classification in certain cases. We analyze the performance across number of in-context examples and different model scales, showing that larger models are necessary to effectively and consistently make use of larger context lengths for ICL. By running several ablations, we analyze the model's use of: a) the similarity of the in-context examples to the current input, b) the semantic content of the class names, and c) the correct correspondence between examples and labels. We demonstrate that all three are needed to varying degrees depending on the domain, contrary to certain recent works.
上下文学习(ICL)使用大型语言模型进行多项标签任务的挑战是由于上下文窗口有限,这使得很难在提示中装入足够的示例。在本文中,我们使用预训练的密集检索模型绕过了这种限制,为模型只提供了每个推理调用的完整标签空间的 partial 视图。与最近的开源LLM(OPT、LLaMA)进行测试,我们为三种常见的意图分类数据集在少量样本设置中创造了新的顶尖性能,而无需微调。我们还在某些情况下超越了微调性能。我们对上下文中的示例数量和不同模型规模的性能进行分析,表明大型模型是必要的,以便有效地、一致性地利用更大的上下文长度来进行ICL。通过运行几个析因函数,我们分析了模型使用的内容:a) 上下文中的示例与当前输入的相似性,b) 类名称语义内容,以及c) 示例和标签的正确对应关系。我们证明了,这三个方面都是必不可少的,取决于领域,与某些最近的工作相反。
In this work, we introduce the concept of complex text style transfer tasks, and constructed complex text datasets based on two widely applicable scenarios. Our dataset is the first large-scale data set of its kind, with 700 rephrased sentences and 1,000 sentences from the game Genshin Impact. While large language models (LLM) have shown promise in complex text style transfer, they have drawbacks such as data privacy concerns, network instability, and high deployment costs. To address these issues, we explore the effectiveness of small models (less than T5-3B) with implicit style pre-training through contrastive learning. We also propose a method for automated evaluation of text generation quality based on alignment with human evaluations using ChatGPT. Finally, we compare our approach with existing methods and show that our model achieves state-of-art performances of few-shot text style transfer models.
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at this http URL.
当前的Panoptic segmentation方法需要巨大的标记阴性训练数据,这对广泛采用这些方法提出了巨大的挑战。同时,视觉表示学习领域的最近突破引发了范式的转变,导致出现了可以训练完全无标签图像的大型基础模型。在这个研究中,我们提议利用这些任务无关的图像特征,以通过呈现几乎无标签的Panoptic信息分割(SPINO)方法实现多次 Panoptic 分割。具体来说,我们的方法结合了 DINOv2 骨干网络和轻量级网络头部,用于语义分割和边界估计。我们表明,尽管我们训练了只有十张标记阴性的图像,但我们预测了高质量的伪标签,可以与任何现有的 Panoptic 分割方法一起使用。值得注意的是,我们证明了SPINO相对于完全监督基准线实现了竞争结果,同时使用了不到0.3%的 ground truth 标签,为利用基础模型学习复杂的视觉识别任务开辟了道路。为了展示其通用性,我们进一步在室内外真实的机器人视觉系统中部署了SPINO。为了促进未来的研究,我们将代码和训练模型在此httpURL上公开发布。
Object detection is an essential and fundamental task in computer vision and satellite image processing. Existing deep learning methods have achieved impressive performance thanks to the availability of large-scale annotated datasets. Yet, in real-world applications the availability of labels is limited. In this context, few-shot object detection (FSOD) has emerged as a promising direction, which aims at enabling the model to detect novel objects with only few of them annotated. However, many existing FSOD algorithms overlook a critical issue: when an input image contains multiple novel objects and only a subset of them are annotated, the unlabeled objects will be considered as background during training. This can cause confusions and severely impact the model's ability to recall novel objects. To address this issue, we propose a self-training-based FSOD (ST-FSOD) approach, which incorporates the self-training mechanism into the few-shot fine-tuning process. ST-FSOD aims to enable the discovery of novel objects that are not annotated, and take them into account during training. On the one hand, we devise a two-branch region proposal networks (RPN) to separate the proposal extraction of base and novel objects, On another hand, we incorporate the student-teacher mechanism into RPN and the region of interest (RoI) head to include those highly confident yet unlabeled targets as pseudo labels. Experimental results demonstrate that our proposed method outperforms the state-of-the-art in various FSOD settings by a large margin. The codes will be publicly available at this https URL.
Controllable text generation is a fundamental aspect of natural language generation, with numerous methods proposed for different constraint types. However, these approaches often require significant architectural or decoding modifications, making them challenging to apply to additional constraints or resolve different constraint combinations. To address this, our paper introduces Regular Expression Instruction (REI), which utilizes an instruction-based mechanism to fully exploit regular expressions' advantages to uniformly model diverse constraints. Specifically, our REI supports all popular fine-grained controllable generation constraints, i.e., lexical, positional, and length, as well as their complex combinations, via regular expression-style instructions. Our method only requires fine-tuning on medium-scale language models or few-shot, in-context learning on large language models, and requires no further adjustment when applied to various constraint combinations. Experiments demonstrate that our straightforward approach yields high success rates and adaptability to various constraints while maintaining competitiveness in automatic metrics and outperforming most previous baselines.
可控文本生成是自然语言生成的基础方面,提出了多种针对不同约束类型的方法。然而,这些方法通常需要重大的建筑或解码修改,使其难以应用于额外的约束或解决不同约束的组合。为了解决这一问题,我们的论文介绍了Regular Expression Instruction (REI),利用指令机制充分利用 Regular Expression 的优势,均匀建模各种约束。具体来说,我们的 REI 支持所有流行的精细控制生成约束,如词汇、位置和长度,以及它们的复杂组合,通过 Regular Expression 式指令支持。我们的方法只需要在中型语言模型或少量shot中进行微调,并在大型语言模型上进行上下文学习,而无需进一步调整应用于各种约束组合。实验表明,我们的简单方法产生高成功率和对各种约束的适应性,同时在自动指标上保持竞争力,并超越大多数以前的基准。
Graph Neural Networks (GNNs) have become popular in Graph Representation Learning (GRL). One fundamental application is few-shot node classification. Most existing methods follow the meta learning paradigm, showing the ability of fast generalization to few-shot tasks. However, recent works indicate that graph contrastive learning combined with fine-tuning can significantly outperform meta learning methods. Despite the empirical success, there is limited understanding of the reasons behind it. In our study, we first identify two crucial advantages of contrastive learning compared to meta learning, including (1) the comprehensive utilization of graph nodes and (2) the power of graph augmentations. To integrate the strength of both contrastive learning and meta learning on the few-shot node classification tasks, we introduce a new paradigm: Contrastive Few-Shot Node Classification (COLA). Specifically, COLA employs graph augmentations to identify semantically similar nodes, which enables the construction of meta-tasks without the need for label information. Therefore, COLA can utilize all nodes to construct meta-tasks, further reducing the risk of overfitting. Through extensive experiments, we validate the essentiality of each component in our design and demonstrate that COLA achieves new state-of-the-art on all tasks.
图形神经网络(GNNs)在图形表示学习(GRL)中变得越来越流行。一个基本的应用是多次采样节点分类。大多数现有方法都遵循元学习范式,表明能够快速 generalization 到多次采样任务的能力。然而,最近的研究表明,Graph Contrastive Learning 与微调相结合可以显著地优于元学习方法。尽管取得了实验成功,但对其背后原因的理解仍然有限。在我们的研究中,我们首先识别了对比学习相对于元学习的两个关键优势,包括(1)全面利用图形节点和(2)图形增强的力量。为了将对比学习和元学习的力量集成到多次采样节点分类任务中,我们引入了一种新的范式:对比多次采样节点分类(COLA)。具体来说,COLA使用图形增强来识别语义上相似的节点,从而使元学习任务无需标签信息即可构建。因此,COLA可以利用所有节点构建元学习任务,进一步减少过拟合风险。通过广泛的实验,我们验证了我们设计中每个组件的重要性,并证明COLA在所有任务上都实现了新的先进水平。