The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未训练过的类上表现良好。以前的算法从开始训练就开始训练特征金字塔和检测头,这破坏了在预训练期间建立的视觉文本特征对齐,并努力防止语言模型忘记未训练过的类。我们提出了三种方法来缓解这些问题。第一种方法是使用简单的方案来增加文本嵌入,以防止在训练期间看到的少数类上过度拟合,同时同时节省内存和计算。第二种方法是修改特征金字塔网络和检测头,包括可训练的门控快捷方式,这鼓励视觉文本特征对齐,并在检测训练开始时保证它。最后一种方法是利用更大的图像文本对语料库,从而提高检测在这些类上没有人类标注 bounding box 的检测性能。我们三种方法在 LVIS 基准测试的零样本版本上进行评估,每个方法都表现出明显和重要的 benefits。我们的最终网络在 mAP-all 度量上实现了新的前沿技术,并表现出 mAP-罕见的类上的 competitive 性能,以及与 COCO 和 Object365 相比更好的传输性能。
https://arxiv.org/abs/2303.13518
The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
本研究的目标是解决临床报告的易错性问题,以便允许进行科学研究,同时确保患者隐私。研究强调了在这个领域的分享工具和资源所面临的困难,并介绍了巴黎大巴黎大学医院(AP-HP)在从临床数据仓库中系统命名化文本文档的经验。我们对临床文档的语料库进行了注释,按照12种识别实体类型进行标注,并建立了一个混合系统,将深度学习模型和手动规则的结果合并。我们的结果显示整体表现达到F1得分的0.99。我们讨论了实现选择,并介绍了实验,以更好地理解这种任务所需的努力,包括数据集大小、文档类型、语言模型或规则增加。我们遵循3项BSD许可证分享指南和代码。
https://arxiv.org/abs/2303.13451
To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.
检测大型语言模型用于恶意使用 case(例如虚假内容创建或学术抄袭),有几个方法最近被提出以通过水印或统计不规则来确定由人工智能生成的文字。这些检测算法对人工智能生成文字的重写攻击的鲁棒性如何?为了压力测试这些检测算法,我们首先训练了一个11B参数的重写生成模型(DIPPER),该模型可以重写段落,并可选地利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用 scalar knobs 控制重写句中的词汇多样性和排序。由三个大型语言模型(包括 GPT3.5-davinci-003)生成的重写文本,使用 DIPPER成功地逃避了多个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文字分类器。例如,DIPPER将检测到 DetectGPT的检测准确率从70.3%降低到4.6%(在 constant false positive rate of 1% 不变的情况下),而不会对输入语义性有任何显著影响。为了增加人工智能生成文字检测重写攻击的鲁棒性,我们引入了一种简单的防御措施,它依赖于从语言模型 API 中提取语义相似的生成,必须由语言模型 API 提供商维护。给定一个候选文本,我们的算法搜索先前由 API 生成的序列数据库,寻找匹配候选文本在一定阈值内的序列。我们使用微调的 T5-XXL 模型的15M 生成序列数据库进行经验验证,发现它可以在不同设置下检测到80%至97%的重写生成序列,而仅将人类编写的序列归类为人工智能生成。我们将开源我们的代码、模型和数据,以供未来的研究。
https://arxiv.org/abs/2303.13408
Automated diagnosis prediction from medical images is a valuable resource to support clinical decision-making. However, such systems usually need to be trained on large amounts of annotated data, which often is scarce in the medical domain. Zero-shot methods address this challenge by allowing a flexible adaption to new settings with different clinical findings without relying on labeled data. Further, to integrate automated diagnosis in the clinical workflow, methods should be transparent and explainable, increasing medical professionals' trust and facilitating correctness verification. In this work, we introduce Xplainer, a novel framework for explainable zero-shot diagnosis in the clinical setting. Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task. Specifically, instead of directly predicting a diagnosis, we prompt the model to classify the existence of descriptive observations, which a radiologist would look for on an X-Ray scan, and use the descriptor probabilities to estimate the likelihood of a diagnosis. Our model is explainable by design, as the final diagnosis prediction is directly based on the prediction of the underlying descriptors. We evaluate Xplainer on two chest X-ray datasets, CheXpert and ChestX-ray14, and demonstrate its effectiveness in improving the performance and explainability of zero-shot diagnosis. Our results suggest that Xplainer provides a more detailed understanding of the decision-making process and can be a valuable tool for clinical diagnosis.
医学图像的自动诊断预测是一种重要的资源,以支持临床决策。然而,这种系统通常需要从大量的标注数据中进行训练,这在医学领域中往往是缺乏的。零样本方法解决了这个问题,它可以在没有标记数据的情况下灵活适应不同的临床发现设置,无需依赖标签数据。进一步,将自动诊断集成到临床工作流程中,方法应该透明和可解释,增加医务人员的信任,并方便正确性验证。在这个项目中,我们介绍了Xplainer,一个可在临床环境中解释零样本诊断的新框架。Xplainer将竞争视觉语言模型的描述分类方法应用于多标签医学诊断任务。具体来说,我们不再直接预测诊断,而是促使模型分类描述观察的存在,这是放射科医生在X射线扫描中会寻找的描述观察,并使用描述概率估计诊断的可能性。我们的模型是设计可解释的,因为其最终诊断预测直接基于底层描述预测。我们评估了 CheXpert和 chestX-ray14两个心电学数据集,并证明了Xplainer在改善零样本诊断性能和解释性方面的效力。我们的结果表明,Xplainer提供了更详细的理解决策过程,可以成为临床诊断的宝贵工具。
https://arxiv.org/abs/2303.13391
We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at this https URL.
我们介绍了瑞士BERT语言模型,这是一个专门处理与瑞士相关的文本的Masked语言模型。瑞士BERT是一个预训练模型,我们将其适应于用瑞士四种官方语言撰写的新闻文章。我们针对与瑞士相关的自然语言理解任务进行了评估,发现瑞士BERT在这些任务中往往比先前模型表现更好,特别是处理 contemporary news 和/或瑞士德语方言时。由于瑞士BERT使用语言适应器,未来工作可以将它扩展到瑞士德语方言。模型和我们的开源代码在此httpsURL上公开发布。
https://arxiv.org/abs/2303.13310
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
在本研究中,我们提出了一个端到端的知识图问答系统,名为GETT-QA。GETT-QA使用了一个流行的文本到文本预训练语言模型T5。该模型以自然语言问题作为输入,并生成简化版的SPARQL查询。在简化版中,模型并不直接生成实体和关系ID。相反,它生成相应的实体和关系标签。在后续步骤中,标签被连接到知识实体和关系ID。为了进一步改善结果,我们要求模型为每个实体生成一份知识实体嵌入的截断版本。截断知识实体嵌入为实现更细的歧义查找而提供了便利。我们发现,T5能够无需改变损失函数而学习截断知识实体嵌入,从而提高了KGQA性能。因此,我们报告了LC-QuAD 2.0和SimpleQuestions-Wikidata datasets在Wikidata上端到端KGQA方面的出色结果。
https://arxiv.org/abs/2303.13284
Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.
Prompttuning是一种有效的方法,通过使用与任务相关的文本代币将预训练的视觉语言模型(VLM)适应后续任务,而代表性的COOp工作将可学习文本代币与类代币结合,以获取特定的文本知识。然而,对于未知的类,这种特定的文本知识是更加泛化到它们,因为它们忘记了具有强泛化能力的一般性的文本知识。为了解决这个问题,我们介绍了一种新的知识引导上下文优化(KgCoOp),以增强未知类可学习prompt的泛化能力。KgCoOp的关键洞察力是,忘记重要的知识可以通过减少可学习prompt和手工制作prompt之间的差异来缓解。特别是,KgCoOp最小化由可学习prompt生成的文本嵌入与手工制作prompt之间的差异。最后,在对比度损失的基础上添加KgCoOp可以生成对于可见任务和未知任务具有区分性的prompt。对多个基准任务的广泛评估表明,提议的知识引导上下文优化是一种Prompttuning的有效方法, \emph{i.e.},在更少的训练时间内获得更好的性能。
https://arxiv.org/abs/2303.13283
Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
场景Graph生成(SGG)旨在从图像中提取<主题、谓词、对象>关系以视觉理解。尽管最近的工作在SGG方面取得了稳定的进展,但它们仍然面临长尾巴分布问题,长谓词训练代价更高,且由于少量的注释数据,与频繁谓词相比难以区分。现有的平衡策略试图通过先前规则来实现,但仍然局限于预定义条件,这对各种模型和数据集是不可扩展的。在本文中,我们提出了一种跨模态预比较增强(CaCao)框架,其中视觉提示的语言模型以低资源方式生成多种精细的谓词。 proposed CaCao可以以一种可插拔的方式应用,并自动加强现有的SGG以解决长尾巴问题。基于这一点,我们进一步介绍了一种名为“开放世界谓词场景Graph生成(Epic)”的全新的、相互交织的跨模态提示方法,其中模型可以在零样本情况下 generalization到未观察到的谓词。对三个基准数据集的全面实验表明,CaCao consistentlyBoost了多个场景Graph生成模型的性能,以一种模型无关的方式。此外,我们的Epic在开放世界谓词预测方面实现了竞争性能。
https://arxiv.org/abs/2303.13233
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, \ie, CLIP~\cite{radford2021clip}, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at this https URL.
不完整标签的多标签识别(MLR)是非常具有挑战性的。最近的工作致力于探索图像到标签映射在视觉语言模型中的可能性,例如\cite{radford2021clip},以弥补缺乏标注的不足。尽管表现令人鼓舞,但他们通常忽略了关于标签到标签映射的宝贵先验。在本文中,我们倡导补救不完整标签的标签监督不足,通过从语义先验prompter中推导出结构化的语义先验来建立label-to-label映射的结构化先验。然后,我们提出了一个 novel Semantic Correspondence Prompt Network (SCPNet),它能够全面探索结构化语义先验。此外,我们还介绍了一种增强的自监督学习方法,以增强使用先验。对多个广泛使用基准数据集的全面实验和分析表明,我们的方法和所有方法在所有数据集上显著优于现有方法,充分证明了我们方法的有效性和优越性。我们的代码将在这个 https URL上可用。
https://arxiv.org/abs/2303.13223
Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2\% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR
在自然语言处理(NLP)中,使用Adapters作为参数高效的转移学习替代方法已经得到了研究。Adapters能够在Transformer层之间添加小型瓶颈层,同时保持大型预训练语言模型(PLM)冻结,从而实现 Memory-Efficient Transfer Learning。尽管在NLP中取得了 promising 的结果,但在信息检索中这些方法仍然未被深入研究。尽管以前的研究仅尝试过密集检索或跨语言检索场景,但本 paper 旨在完整描述在IR中使用Adapters的情况。首先,我们研究了Adapters-SPLADE,它是一个稀疏检索器,Adapters不仅保留了经过微调后实现的效率与效果,而且具有 Memory-Efficient 和数量级更轻的训练能力。我们观察到Adapters-SPLADE不仅优化了训练参数的2\% ,而且在IR基准数据集上比完全微调的替代品和现有的参数高效的密集IR模型表现更好。其次,我们考虑了跨域BEIR数据和 TripClick Adapters 的神经网络检索域适应问题。最后,我们还考虑了重新排名器和第一级排名器之间的知识共享。总之,我们的研究涵盖了神经网络IRAdapters的使用。
https://arxiv.org/abs/2303.13220
Large language models have demonstrated surprising ability to perform in-context learning, i.e., these models can be directly applied to solve numerous downstream tasks by conditioning on a prompt constructed by a few input-output examples. However, prior research has shown that in-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. Therefore, the construction of an appropriate prompt is essential for improving the performance of in-context learning. In this paper, we revisit this problem from the view of predictive bias. Specifically, we introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. Then we empirically show that prompts with higher bias always lead to unsatisfactory predictive quality. Based on this observation, we propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning. We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model's in-context learning performance in an effective and interpretable manner.
大型语言模型已经表现出惊人的在上下文中进行学习的能力,即这些模型可以通过对几个输入输出示例构建的提示进行条件化来解决许多后续任务。然而,先前的研究已经表明,由于训练示例、示例顺序和提示格式的变异,在上下文学习中可能会出现高不稳定性。因此,构建适当的提示是至关重要的,以改善在上下文中的学习表现。在本文中,我们重新考虑这个问题从预测偏差的视角出发。具体来说,我们引入了一种度量方法,以评估固定提示与标签或给定属性的预测偏差。然后,我们经验证了高偏差的提示总是会导致不满意的预测质量。基于这一观察,我们提出了一种基于贪婪搜索的新搜索策略,以识别改善在上下文中的学习表现的最佳提示。我们与最先进的主流模型如GPT-3在各种后续任务上进行综合实验。我们的结果表明,我们的方法可以在有效且可解释的方式增强模型在上下文学习中的表现。
https://arxiv.org/abs/2303.13217
Various recent experimental results show that large language models (LLM) exhibit emergent abilities that are not present in small models. System performance is greatly improved after passing a certain critical threshold of scale. In this letter, we provide a simple explanation for such a phase transition phenomenon. For this, we model an LLM as a sequence-to-sequence random function. Instead of using instant generation at each step, we use a list decoder that keeps a list of candidate sequences at each step and defers the generation of the output sequence at the end. We show that there is a critical threshold such that the expected number of erroneous candidate sequences remains bounded when an LLM is below the threshold, and it grows exponentially when an LLM is above the threshold. Such a threshold is related to the basic reproduction number in a contagious disease.
各种最近的实验结果表明,大型语言模型(LLM)表现出小型模型无法出现的 emergent 能力。系统性能在达到某个规模 critical 阈值后极大地改善。在本信中,我们提供了这种相位转移现象的简单解释。为此,我们将 LLM 建模为序列到序列随机函数。 Instead of 使用每个步骤的即时生成,我们使用一个列表解码器,在每个步骤中保持一个列表,并在最后推迟生成输出序列的生成。我们表明,存在一个 critical 阈值,即当 LLM 低于该阈值时,错误候选序列的预计数量将保持有限值,而当 LLM 高于该阈值时,它会呈指数级增长。这种阈值与传染病的基本复制数有关。
https://arxiv.org/abs/2303.13112
Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code has been released here~\footnote{\url{this https URL}} and you can download our model here~\footnote{\url{this https URL}}.
预训练语言模型(PLMs)在多种自然语言处理任务中取得了惊人的改善。大多数 Chinese PLMs 仅仅将输入文本视为字符序列,完全忽略了单词信息。虽然全单词遮蔽可以缓解这种情况,但单词的语义仍然未被很好地表示。在本文中,我们重新考虑了 Chinese PLMs 的分割粒度。我们提出了一种混合粒度的 Chinese BERT(MigBERT),通过考虑字符和单词信息。为了实现这一点,我们设计了一个用于学习字符和单词级别的表示的目标函数。我们对多种 Chinese NLP 任务进行了广泛的实验,以评估现有的PLMs 和提出的 MigBERT。实验结果显示,MigBERT在这些任务中取得了新的 SOTA 表现。进一步的分析表明,单词的语义比字符更丰富。更有趣的的是,我们表明 MigBERT 也适用于日语。我们的代码已在这里发布,你可以在这里下载我们的模型。
https://arxiv.org/abs/2303.13065
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
最近的开放词汇检测方法旨在通过从训练大量图像文本对视觉语言模型(VLMs)的知识进行蒸馏,检测新对象。为了改进这些方法的效果,研究人员使用了大量的词汇表数据,其中包含大量对象类别,假设这些数据可以让模型提取关于各种对象的关系和更广泛地应用于未观测的对象类别的全面知识。在本研究中,我们认为需要更多的精细标签才能提取更丰富的知识,包括对象属性和关系,除了名称。为了解决这个挑战,我们提出了一种简单的有效的方法名为“伪标题标签”(PCL),该方法使用图像标题生成模型生成描述对象实例的不同角度的摘要。所产生的伪标题标签提供了密度丰富的知识蒸馏样本。在LVIS基准测试中,我们训练的最优模型在未重复训练的视觉基因组数据集上取得了34.5的AP和30.6的APr,与最先进的性能相当。PCL的简单易用和灵活性是其他显著的特征,它是一种简单的预处理技术,可以与任何图像标题生成模型一起使用,而无需对模型架构或训练过程施加任何限制。
https://arxiv.org/abs/2303.13040
Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.
电子健康记录(EHRs)存储了广泛的患者信息,包括医疗历史、诊断、治疗和测试结果。这些记录对于使医疗保健提供者做出关于护理的知情决策至关重要。总结临床笔记进一步协助医疗保健专业人员指出潜在的健康风险并做出更好的知情决策。这个过程有助于减少错误并增强患者的治疗效果,通过确保提供者访问最相关和最新的患者数据来实现。最近的研究表明,包括大型语言模型(LLMs)的提示极大地提高了摘要任务的有效性。然而,我们表明,这种方法也导致输出变异性增加,即使提示共享相似的含义,仍会导致显著的不同输出。为了应对这个挑战,我们引入了一种模型无关的软提示-基于校准(SPeC)管道,采用软提示以减少变异性,同时保留提示摘要的优势。对多个临床笔记任务和LLM的实验室发现表明,我们的方法不仅增强了性能,而且有效地限制了各种LLM的输出变异性,提供了一种更均匀和可靠的摘要重要医疗信息的解决方案。
https://arxiv.org/abs/2303.13035
Gesture synthesis has gained significant attention as a critical research area, focusing on producing contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of Large Language Models (LLMs), such as GPT. By capitalizing on the strengths of LLMs for text analysis, we design prompts to extract gesture-related information from textual input. Our method entails developing prompt principles that transform gesture generation into an intention classification problem based on GPT, and utilizing a curated gesture library and integration module to produce semantically rich co-speech gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures, offering a new perspective on semantic co-speech gesture generation.
手势合成作为一个重要的研究领域,重点是如何产生与语音或文本输入对应的适当、自然手势。尽管基于深度学习的方法已经取得了显著进展,但它们往往忽略了文本中丰富的语义信息,导致表达力和有意义的手势减少。我们提出了GesGPT,一种手势生成的新型方法,利用大型语言模型(LLM)如GPT的语义分析能力。通过利用LRM在文本分析方面的优势,我们设计Prompts从文本输入中提取手势相关的信息。我们的方法和方法包括开发Prompt Principles,将手势生成转换为基于GPT的意图分类问题,并利用 curated gesture 库和集成模块生产语义丰富的合并口语手势。实验结果表明,GesGPT有效地生成了适当的、表达性的手势,提供了语义合并口语手势生成的新视角。
https://arxiv.org/abs/2303.13013
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at this https URL.
我们介绍了LMCodec,一个 causal 神经网络语音编码器,提供低比特率下的高质量语音。该系统的核心是一个 causal 卷积编码器,通过残留向量化将音频编码为精细到粗的代币层级,从而实现更少量的代码传输。LMCodec 训练了一个 Transformer 语言模型,以生成从粗代币到精细代币的预测,从而允许更少的代码传输。第二个 Transformer 预测了给定过去传输的代码的不确定性,并用于执行条件熵编码。一项MusHRA 主观测试进行了 conducted,表明质量在更高的比特率下与参考codec 相当。示例音频可用在这个 https URL 上。
https://arxiv.org/abs/2303.12984
The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. In light of these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.
像 ChatGPT 和 AlphaFold 等基因为改善患者护理和医院运营而引起了巨大的兴趣,但是最近的繁荣掩盖了我们对这些模型能力的关键差距的理解。我们回顾了超过 80 个基于非成像 EMR 数据(即临床文本和/或结构化数据)的训练基因为建立类似模型的目标,并建立了分类器,描述了它们的架构、训练数据和潜在用途。我们发现,大多数模型都训练在小型、狭隘的临床试验数据(如 MIMIC-III)或广泛的公共生物医学库(如 PubMed)上,并且在评估任务中无法提供对它们对医疗系统有用性的有意义 insights。基于这些发现,我们提出了一个改进的评估框架,用于测量临床基因为改善医疗系统所带来好处,更加接近在医疗保健中重要的指标。
https://arxiv.org/abs/2303.12961
The potential of large language models (LLMs) to reason like humans has been a highly contested topic in Machine Learning communities. However, the reasoning abilities of humans are multifaceted and can be seen in various forms, including analogical, spatial and moral reasoning, among others. This fact raises the question whether LLMs can perform equally well across all these different domains. This research work aims to investigate the performance of LLMs on different reasoning tasks by conducting experiments that directly use or draw inspirations from existing datasets on analogical and spatial reasoning. Additionally, to evaluate the ability of LLMs to reason like human, their performance is evaluted on more open-ended, natural language questions. My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks. I believe these experiments are crucial for informing the future development of LLMs, particularly in contexts that require diverse reasoning proficiencies. By shedding light on the reasoning abilities of LLMs, this study aims to push forward our understanding of how they can better emulate the cognitive abilities of humans.
大型语言模型(LLM)像人类一样进行推理的潜在能力一直是机器学习社区中高度争议的话题。然而,人类的思维能力具有多方面的特点,可以表现在不同的形式中,包括类比、空间和行为推理等。这一事实引发了一个问题,即LLM是否能在所有不同的领域中表现同样出色。本研究旨在通过直接使用或借鉴现有的类比和空间推理数据集来开展实验,以研究LLM在不同推理任务中的表现。此外,为了评估LLM像人类一样推理的能力,我们对更加开放自然语言问题的表现进行了评估。我的研究结果表明,LLM在类比和道德推理方面表现优异,但在空间推理任务中表现不足。我相信这些实验对于LLM未来的发展前景至关重要,特别是在需要多种推理能力的场景下。通过深入研究LLM的推理能力,本研究旨在推动我们理解如何更好地模拟人类的认知能力。
https://arxiv.org/abs/2303.12810
Electronic medical records (EMRs) are stored in relational databases. It can be challenging to access the required information if the user is unfamiliar with the database schema or general database fundamentals. Hence, researchers have explored text-to-SQL generation methods that provide healthcare professionals direct access to EMR data without needing a database expert. However, currently available datasets have been essentially "solved" with state-of-the-art models achieving accuracy greater than or near 90%. In this paper, we show that there is still a long way to go before solving text-to-SQL generation in the medical domain. To show this, we create new splits of the existing medical text-to-SQL dataset MIMICSQL that better measure the generalizability of the resulting models. We evaluate state-of-the-art language models on our new split showing substantial drops in performance with accuracy dropping from up to 92% to 28%, thus showing substantial room for improvement. Moreover, we introduce a novel data augmentation approach to improve the generalizability of the language models. Overall, this paper is the first step towards developing more robust text-to-SQL models in the medical domain.\footnote{The dataset and code will be released upon acceptance.
电子医疗记录(EMRs)存储在关系型数据库中。如果用户不熟悉数据库表 schema或一般数据库基础结构,那么访问所需的信息可能会非常困难。因此,研究人员已经探索了文本到SQL生成方法,以便提供医疗保健专业人员直接访问EMR数据,而不需要数据库专家。然而,目前可用的数据集基本上已经“解决”,最先进的模型准确率超过或接近于90%。在本文中,我们表明,在医疗领域中解决文本到SQL生成问题还有很长的路要走。为了展示这一点,我们创造了新的医疗文本到SQL数据集MIMICSQL的分集,更好地衡量结果模型的通用性。我们评估了最先进的语言模型在我们的新分集中的表现,显示性能大幅度下降,准确率从高达92%降至28%,因此表明有很大的改进空间。此外,我们引入了一种新的数据增强方法,以提高语言模型的通用性。总的来说,本文是开发医疗领域中更稳定的文本到SQL模型的第一步。
https://arxiv.org/abs/2303.12898