Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at this https URL.
由于最近在姿态估计方法方面的先进技术,可以从常见的视频中提取以3D骨骼序列的形式呈现的人类运动。尽管这些应用机会非常好,但基于内容的高效和有效地访问此类空间时间和骨骼数据的问题仍然是一项挑战性的问题。在本文中,我们提出了一种基于内容的文本到运动检索任务,旨在根据指定的自然语言文本描述检索相关的运动。为了定义这个未知的任务基准,我们使用了BERT和CLIP语言表示来编码文本模式,并成功实现了空间时间和运动模式的特定模型来编码运动模式。此外,我们引入了我们的Transformer-based方法,称为 Motion Transformer(MoT),它使用Divided Space-time Attention有效地将不同骨骼关节在时间和空间上的整合。受到最近文本到图像/视频匹配的进展启发,我们尝试了两种广泛应用的度量学习损失函数。最后,我们建立了一种通用的评估协议,通过定义定性度量来评估检索运动的质量,主要针对最近引入的KIT Motion-Language和HumanML3D数据集。代码复制我们的结果可在上述httpsURL上获取。
https://arxiv.org/abs/2305.15842
The term "Code Mixed" refers to the use of more than one language in the same text. This phenomenon is predominantly observed on social media platforms, with an increasing amount of adaptation as time goes on. It is critical to detect foreign elements in a language and process them correctly, as a considerable number of individuals are using code-mixed languages that could not be comprehended by understanding one of those languages. In this work, we focus on low-resource Hindi-English code-mixed language and enhancing the performance of different code-mixed natural language processing tasks such as sentiment analysis, emotion recognition, and hate speech identification. We perform a comparative analysis of different Transformer-based language Models pre-trained using unsupervised approaches. We have included the code-mixed models like HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English downstream tasks. We report state-of-the-art results on respective datasets using HingBERT-based models which are specifically pre-trained on real code-mixed text. Our HingBERT-based models provide significant improvements thus highlighting the poor performance of vanilla BERT models on code-mixed text.
"代码混合"一词指的是在同一文本中使用多种语言的现象,这在社交媒体平台上尤为普遍,随着时间的流逝,适应度不断增加。重要的是要识别语言中的异国元素,并正确处理它们,因为相当多人使用代码混合语言,这些语言无法通过理解其中一种语言来理解。在本文中,我们重点关注资源有限的希伯来语-英语代码混合语言,并提高不同代码混合自然语言处理任务(如情感分析、情绪识别和恶言识别)的性能。我们使用无监督方法预先训练的不同Transformer-based语言模型进行了比较分析。我们包括代码混合模型,如HingBERT、HingRoBERTa、HingRoBERTa-混合、mBERT和非代码混合模型,如AlBERT、BERT和RoBERTa,以对代码混合希伯来语-英语下游任务进行代码混合语言比较分析。我们使用HingBERT-based模型分别报告了各自数据集的最佳结果,这些模型是在真实代码混合文本中进行预先训练的。我们的HingBERT-based模型提供了显著的改进,从而突出了代码混合文本中普通BERT模型表现不佳的情况。
https://arxiv.org/abs/2305.15722
Answer Sentence Selection (AS2) is a core component for building an accurate Question Answering pipeline. AS2 models rank a set of candidate sentences based on how likely they answer a given question. The state of the art in AS2 exploits pre-trained transformers by transferring them on large annotated datasets, while using local contextual information around the candidate sentence. In this paper, we propose three pre-training objectives designed to mimic the downstream fine-tuning task of contextual AS2. This allows for specializing LMs when fine-tuning for contextual AS2. Our experiments on three public and two large-scale industrial datasets show that our pre-training approaches (applied to RoBERTa and ELECTRA) can improve baseline contextual AS2 accuracy by up to 8% on some datasets.
Answer Sentence Selection (AS2) 是构建准确问答管道的核心组件。AS2 模型根据它们是否可能回答给定问题的概率对一组候选句子进行排序。AS2 的先进技术利用预训练的transformers 在大型注释数据集上传输它们,同时使用候选句子周围的局部上下文信息。在本文中,我们提出了三个预训练目标,旨在模拟后续上下文 AS2 微调任务。这允许在微调上下文 AS2 时专业化LMs。我们对三个公共和两个大型工业数据集进行了实验,结果表明,我们的预训练方法(应用于罗BERTa和ELECTRA)可以在一些数据集上提高基线上下文 AS2 准确性,高达8%。
https://arxiv.org/abs/2305.15358
Masked language modeling, widely used in discriminative language model (e.g., BERT) pretraining, commonly adopts a random masking strategy. However, random masking does not consider the importance of the different words in the sentence meaning, where some of them are more worthy to be predicted. Therefore, various masking strategies (e.g., entity-level masking) are proposed, but most of them require expensive prior knowledge and generally train from scratch without reusing existing model weights. In this paper, we present Self-Evolution learning (SE), a simple and effective token masking and learning method to fully and wisely exploit the knowledge from data. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach. Experiments on 10 tasks show that our SE brings consistent and significant improvements (+1.43~2.12 average scores) upon different PLMs. In-depth analyses demonstrate that SE improves linguistic knowledge learning and generalization.
遮蔽语言建模(如BERT)预训练广泛使用遮蔽策略,通常采用随机遮蔽策略。然而,随机遮蔽并未考虑句子中不同单词在含义中的作用,其中一些单词更有价值被预测。因此,提出了各种遮蔽策略(如实体遮蔽),但大多数都需要昂贵的前置知识,并且通常从头开始训练,未使用现有模型权重。在本文中,我们介绍了自我演化学习(SE),一种简单而有效的元字符遮蔽和学习方法,充分利用数据中的知识。SE重点是学习那些未被充分探索的元字符,并自适应地规范化训练,通过引入独特的元字符特定标签平滑方法。对10个任务的实验表明,我们的SE在不同聚类任务中表现出一致性和显著改进,平均提高(1.43~2.12)得分。深入分析表明,SE改善了语言学知识学习和泛化。
https://arxiv.org/abs/2305.15275
Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training time without degrading much performance on downstream tasks. However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks. Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping. ScTD aims to encourage the model to learn how to preserve the semantic information in the representation space. Extensive experiments on 12 tasks show that, with the help of our ScTD, token dropping can achieve consistent and significant performance gains across all task types and model sizes. More encouragingly, ScTD saves up to 57% of pretraining time and brings up to +1.56% average improvement over the vanilla token dropping.
代币删除是最近提出的一个策略,旨在加速掩膜语言模型(如BERT)的预训练,通过跳过几个中间层中的输入代币的一小部分计算。这可以有效地减少训练时间,而不会对下游任务的性能做出很大降低。然而,我们经验证地发现,代币删除容易陷入语义损失问题,并在处理语义相关的任务时表现不佳。基于这种情况,我们提出了一种简单但有效的语义一致性学习方法(ScTD),以改善代币删除。ScTD旨在鼓励模型学习如何在表示空间中保留语义信息。针对12个任务进行了广泛的实验,结果显示,使用我们的ScTD,代币删除可以在所有任务类型和模型大小上实现一致性和显著的性能提升。更加令人鼓舞的是,ScTD节省了大量的预训练时间,与传统的代币删除相比,平均提高了1.56%。
https://arxiv.org/abs/2305.15273
Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. Our work instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30% to 15% over the course of pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a standard 15% fixed rate. Further analyses demonstrate that the gains from scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.
大多数与遮蔽语言建模(MLM)目标训练的Transformer工作使用了BERT模型的固定遮蔽率15%。我们的工作则相反,动态地规划遮蔽比例在整个训练中。我们发现,在预训练过程中,从30%逐渐下降到15%,可以提高BERT-base的平均GLUE准确性0.46%,而与标准15%的固定率相比。进一步的分析表明,规划收益来自暴露高和低遮蔽率的情况。我们的结果表明,遮蔽率规划是改善遮蔽语言模型质量的简单方法,并在预训练过程中实现1.89倍的加速。
https://arxiv.org/abs/2305.15096
Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.
最近,各种中间层蒸馏(ILD)目标已被证明可以通过知识蒸馏(KD)来提高BERT模型的压缩性能。然而,在任务特定的设置和任务无关的设置中,对目标的全面评估还缺乏。据我们所知,这是第一项全面评估两种初始化设置下蒸馏目标的工作。我们表明,注意力转移提供了最好的性能。我们还研究了层选择对学生初始化的影响,发现在任务特定的蒸馏中,使用老师更低的层比更高层的层提供更好的性能,特别是在QNLI任务中(准确率的最大 absolute percentage change达到17.8)。注意力转移在各种不同的初始化设置下表现出一致性。我们将其代码作为高效的基于Transformer的模型蒸馏框架进行释放,以供进一步研究。
https://arxiv.org/abs/2305.15032
Misinformation poses a critical societal challenge, and current approaches have yet to produce an effective solution. We propose focusing on generalization, soft classification, and leveraging recent large language models to create more practical tools in contexts where perfect predictions remain unattainable. We begin by demonstrating that GPT-4 and other language models can outperform existing methods in the literature. Next, we explore their generalization, revealing that GPT-4 and RoBERTa-large exhibit critical differences in failure modes, which offer potential for significant performance improvements. Finally, we show that these models can be employed in soft classification frameworks to better quantify uncertainty. We find that models with inferior hard classification results can achieve superior soft classification performance. Overall, this research lays groundwork for future tools that can drive real-world progress on misinformation.
虚假信息是一个关键的社会挑战,而当前的方法尚未能够提供有效的解决方案。我们建议专注于泛化、软分类和利用最近的大型语言模型来创造在无法准确预测的情况下更实用的工具。我们首先证明,GPT-4和其他语言模型可以在文献中比现有方法表现更好。接下来,我们探讨它们的泛化,揭示GPT-4和RoBERTa-large在故障模式方面存在关键差异,这些差异有可能导致显著的性能改进。最后,我们表明这些模型可以在软分类框架中应用,更好地量化不确定性。我们发现,比硬分类结果较差的模型可以实现更好的软分类性能。总的来说,这项研究为未来工具,能够推动虚假信息实际进展的工作奠定了基础。
https://arxiv.org/abs/2305.14928
In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires demonstrations that are informative about the test instance. The standard approach of independently selecting the most similar examples selects redundant demonstrations while overlooking important information. This work proposes a framework for assessing the informativeness of demonstrations based on their coverage of salient aspects (e.g., reasoning patterns) of the test input. Using this framework, we show that contextual token embeddings effectively capture these salient aspects, and their recall measured using BERTScore-Recall (BSR) yields a reliable measure of informativeness. Further, we extend recall metrics like BSR to propose their set versions to find maximally informative sets of demonstrations. On 6 complex compositional generation tasks and 7 diverse LLMs, we show that Set-BSR outperforms the standard similarity-based approach by up to 16% on average and, despite being learning-free, often surpasses methods that leverage task or LLM-specific training.
上下文学习(ICL)是指大型语言模型通过给定几个任务示例来条件化地执行新任务的能力。这种能力需要展示对测试实例 informative 的证明。传统的方式是独立地选择最相似的示例来选择冗余的证明,而忽略了重要的信息。这项工作提出了一个框架,用于评估证明的 informativeness,基于它们对测试输入的显著方面(例如,推理模式)的覆盖。利用这个框架,我们证明了上下文 token 嵌入器有效地捕捉了这些显著方面,并使用 BERTScore-Recall (BSR) 方法测量它们的召回率,从而获得了可靠的 informativeness 测量。此外,我们扩展了召回度量,如 BSR,提出了它们的集版本,以找到最 informative 的证明集。在6个复杂的组合生成任务和7个多样化的LLM中,我们表明,Set-BSR平均领先标准相似性基于方法方法 by 16%,尽管没有学习,但它经常超越利用任务或LLM-特定训练的方法。
https://arxiv.org/abs/2305.14907
In this work, we propose a method for extracting text spans that may indicate one of the BIG5 psychological traits using a question-answering task with examples that have no answer for the asked question. We utilized the RoBERTa model fine-tuned on SQuAD 2.0 dataset. The model was further fine-tuned utilizing comments from Reddit. We examined the effect of the percentage of examples with no answer in the training dataset on the overall performance. The results obtained in this study are in line with the SQuAD 2.0 benchmark and present a good baseline for further research.
在本研究中,我们提出了一种方法,通过一个问答任务,使用例子中没有回答的问题来提取可能表示 Big5 心理特质的文本片段。我们使用了优化了在 SQuAD 2.0 数据集上的罗BERTa 模型。模型还使用Reddit的评论进行了进一步优化。我们研究了训练数据中没有回答的例子百分比对整体表现的影响。这项研究的结果与 SQuAD 2.0 基准一致,提供了一个良好的基准,用于进一步研究。
https://arxiv.org/abs/2305.14891
Most existing stylistic text rewriting methods operate on a sentence level, but ignoring the broader context of the text can lead to generic, ambiguous, and incoherent rewrites. In this paper, we propose the integration of preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting, focusing on formality, toxicity, and sentiment transfer tasks. We conduct a comparative evaluation of rewriting through few-shot prompting of GPT-3.5 and GPT NeoX, comparing non-contextual rewrites to contextual rewrites. Our experiments show that humans often prefer contextual rewrites over non-contextual ones, but automatic metrics (e.g., BLEU, sBERT) do not. To bridge this gap, we propose context-infused versions of common automatic metrics, and show that these better reflect human preferences. Overall, our paper highlights the importance of integrating preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting.
现有的风格化文本改写方法通常只在句子级别上运行,但忽略文本的更广阔的背景可能会导致通用、含糊和不一致的改写。在本文中,我们提议将前面的文本背景融入到风格化文本改写的改写和评估阶段中,重点关注正式性、毒性和情感传递任务。我们通过几个回合的引导技术(GPT-3.5和GPT NeoX)对改写进行了比较评估,并将非上下文改写和上下文改写进行比较。我们的实验结果表明,人类通常更喜欢上下文改写,但自动度量(如BLEU和sBERT)并不这样认为。为了弥补这一差距,我们提出了上下文整合的常用自动度量版本,并表明这些更好地反映了人类偏好。总之,我们的论文强调了将前面的文本背景融入到风格化文本改写的改写和评估阶段中的重要性。
https://arxiv.org/abs/2305.14755
As language models increase in size by the day, methods for efficient inference are critical to leveraging their capabilities for various applications. Prior work has investigated techniques like model pruning, knowledge distillation, and data multiplexing to increase model throughput without sacrificing accuracy. In this paper, we combine two such methods -- structured pruning and data multiplexing -- to compound the speedup gains obtained by either method. Our approach, PruMUX, obtains up to 7.5-29.5X throughput improvement over BERT-base model with accuracy threshold from 80% to 74%. We further study various combinations of parameters (such as sparsity and multiplexing factor) in the two techniques to provide a comprehensive analysis of the tradeoff between accuracy and throughput in the resulting models. We then propose Auto-PruMUX, a meta-level model that can predict the high-performance parameters for pruning and multiplexing given a desired accuracy loss budget, providing a practical method to leverage the combination effectively.
随着语言模型规模的日益增加,高效推理方法对于利用其在不同应用程序中的能力至关重要。先前的工作已经研究了模型修剪、知识蒸馏和数据编码等技术,以在不牺牲准确性的情况下增加模型吞吐量。在本文中,我们将这两种方法结合起来——结构化修剪和数据编码,以加倍 either 方法所取得的速度提升。我们的算法是 PruMUX,它在BERT基础模型的准确性阈值从80%降至74%的情况下,取得了7.5-29.5X的吞吐量改进。我们进一步研究了这两种方法中的各种参数组合(如稀疏性和编码因子),以提供对结果模型准确性和吞吐量之间的权衡的全面分析。我们随后提出了 Auto-PruMUX,一个高级别模型,它可以预测修剪和编码高性能参数,给定所需的准确性损失预算,提供一个有效的方法,有效地利用它们的组合。
https://arxiv.org/abs/2305.14706
Entity bias widely affects pretrained (large) language models, causing them to excessively rely on (biased) parametric knowledge to make unfaithful predictions. Although causality-inspired methods have shown great potential to mitigate entity bias, it is hard to precisely estimate the parameters of underlying causal models in practice. The rise of black-box LLMs also makes the situation even worse, because of their inaccessible parameters and uncalibrated logits. To address these problems, we propose a specific structured causal model (SCM) whose parameters are comparatively easier to estimate. Building upon this SCM, we propose causal intervention techniques to mitigate entity bias for both white-box and black-box settings. The proposed causal intervention perturbs the original entity with neighboring entities. This intervention reduces specific biasing information pertaining to the original entity while still preserving sufficient common predictive information from similar entities. When evaluated on the relation extraction task, our training-time intervention significantly improves the F1 score of RoBERTa by 5.7 points on EntRED, in which spurious shortcuts between entities and labels are removed. Meanwhile, our in-context intervention effectively reduces the knowledge conflicts between parametric knowledge and contextual knowledge in GPT-3.5 and improves the F1 score by 9.14 points on a challenging test set derived from Re-TACRED.
实体偏见广泛影响训练大型语言模型,导致它们过度依赖(有偏见的)参数知识来做出不可靠的预测。虽然基于因果关系的方法已经显示出缓解实体偏见的巨大潜力,但在实践中精确估计 underlying 因果关系模型的参数非常困难。黑盒LLM的不断增加也导致情况更加困难,因为它们的可访问的参数和未校准的logits。为了解决这些问题,我们提出了一种特定的结构因果模型(SCM),其参数相对较容易估计。基于这个SCM,我们提出了因果干预技术,以缓解黑盒和黑盒设置中的实体偏见。我们提出的因果干预影响了原始实体与其相邻实体。这种干预减少了原始实体的特定偏见信息,同时保留了足够的类似实体的通用预测信息。在关系提取任务中进行评估时,我们的训练时间干预显著提高了RoBERTa在 EntRED 上的 F1 得分,其中取消了实体和标签之间的伪捷径。同时,我们的上下文干预有效地减少了GPT-3.5中参数知识与上下文知识之间的知识冲突,并提高了从 Re-TACRED 中提取的一组挑战性测试集上的 F1 得分,提高了 9.14 个百分点。
https://arxiv.org/abs/2305.14695
Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers where each sentence is annotated with all mathematical symbols linked with their corresponding definitions. This dataset focuses specifically on complex coordination structures such as "respectively" constructions, which often contain overlapping definition spans. We also introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling. Our experiments show that our definition extraction model significantly outperforms RoBERTa and other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82. With our dataset and model, we can detect complex definitions in scholarly documents to make scientific writing more readable.
数学符号定义提取对于改善学术阅读接口和提高学术信息提取(IE)任务非常重要。然而,该任务提出了多项挑战:数学符号不是由自然语言词码组成的,因此处理它们非常困难;学术文章通常包含需要解决复杂的坐标结构的句子。我们提出了SymDef一个英语语言数据集,由5,927个句子从完整论文中收集,每个句子都被注释了与它们的对应数学符号及其定义。这个数据集专门关注复杂的坐标结构,如“分别”构造,其中常常包含重叠的定义跨度。我们还介绍了一种新的定义提取方法,它掩盖了数学符号,为每个符号创建每个句子的副本,指定一个目标符号,并使用空插槽预测其相应的定义跨度。我们的实验表明,我们的定义提取模型显著优于RoBERTa和其他强大的IE基线系统,而且宏观F1得分为84.82。通过我们的数据和模型,我们可以在学术文档中识别复杂的定义,使科学写作更易于阅读。
https://arxiv.org/abs/2305.14660
Given the success of Graph Neural Networks (GNNs) for structure-aware machine learning, numerous studies have explored their application to text classification, as an alternative to traditional feature representation models. However, most studies considered just a specific domain and validated on data with particular characteristics. This work presents an extensive empirical investigation of graph-based text representation methods proposed for text classification, identifying practical implications and open challenges in the field. We compare several GNN architectures as well as BERT across five datasets, encompassing short and also long documents. The results show that: i) graph performance is highly related to the textual input features and domain, ii) despite its outstanding performance, BERT has difficulties converging when dealing with short texts, iii) graph methods are particularly beneficial for longer documents.
Graph Neural Networks (GNNs) 在结构aware机器学习方面取得了成功,因此许多研究探索了将其应用于文本分类的可行性,作为对传统特征表示模型的替代。然而,大多数研究仅考虑了一个特定的领域,并验证数据具有特定特点的数据。这项工作提出了对文本分类所提出的基于图的文本表示方法进行广泛的实证研究,并识别了该领域的实际影响和开放挑战。我们比较了多个GNN架构和BERT across 五个数据集,包括短小的和较长的文档。结果显示: i) graph 表现与文本输入特征和领域密切相关,ii) 尽管表现非常出色,BERT 在处理短小文本时存在困难,iii) graph 方法对较长的文档特别有益。
https://arxiv.org/abs/2305.14578
Transformer models bring propelling advances in various NLP tasks, thus inducing lots of interpretability research on the learned representations of the models. However, we raise a fundamental question regarding the reliability of the representations. Specifically, we investigate whether transformers learn essentially isomorphic representation spaces, or those that are sensitive to the random seeds in their pretraining process. In this work, we formulate the Bijection Hypothesis, which suggests the use of bijective methods to align different models' representation spaces. We propose a model based on invertible neural networks, BERT-INN, to learn the bijection more effectively than other existing bijective methods such as the canonical correlation analysis (CCA). We show the advantage of BERT-INN both theoretically and through extensive experiments, and apply it to align the reproduced BERT embeddings to draw insights that are meaningful to the interpretability research. Our code is at this https URL.
Transformer模型在多种自然语言处理任务中取得了推进,从而诱导了对模型学习表示的可解释性研究。然而,我们提出了一个关于表示可靠性的根本性问题。具体来说,我们研究是否Transformers学习的主要表示空间是相似的,还是它们在预训练过程中对随机种子敏感。在这项工作中,我们制定了表示相似性假设,并建议使用映射方法对齐不同模型的表示空间。我们提出了基于可转换神经网络的模型BERT-INN,该模型比其他已知的映射方法如经典相关分析(CCA)学习映射方法更有效。我们理论地和通过广泛的实验展示了BERT-INN的优点,并应用它对齐重复的BERT嵌入,以获取对可解释性研究有意义的洞察力。我们的代码在这 https URL 上。
https://arxiv.org/abs/2305.14555
Machine learning models pre-trained on large datasets have achieved remarkable convergence and robustness properties. However, these models often exploit spurious correlations between certain attributes and labels, which are prevalent in the majority of examples within specific categories but are not predictive of these categories in general. The learned spurious correlations may persist even after fine-tuning on new data, which degrades models' performance on examples that do not exhibit the spurious correlation. In this work, we propose a simple and highly effective method to eliminate spurious correlations from pre-trained models. The key idea of our method is to leverage a small set of examples with spurious attributes, and balance the spurious attributes across all classes via data mixing. We theoretically confirm the effectiveness of our method, and empirically demonstrate its state-of-the-art performance on various vision and NLP tasks, including eliminating spurious correlations from pre-trained ResNet50 on Waterbirds and CelebA, adversarially pre-trained ResNet50 on ImageNet, and BERT pre-trained on CivilComments.
在这项工作中,我们提出了一种简单而高效的方法,用于从大型数据集上训练的机器学习模型中的伪相关性消除。该方法的关键思想是利用伪属性的一些示例,通过数据混合平衡所有类别的伪属性。我们理论证明了该方法的有效性,并Empirically证明了其在多种视觉和自然语言处理任务中的先进性能,包括从对鸟类和名人等类别的预训练 ResNet50 中消除伪相关性,以及在 ImageNet 上的对抗预训练 ResNet50 和预训练在 CivilComments 上的BERT。
https://arxiv.org/abs/2305.14521
Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing (NLP) tasks. However, fine-tuning these models often necessitates substantial supervision, which can be expensive and time-consuming to obtain. This paper introduces a novel unsupervised method called LanguageModel Self-Improvement by Reinforcement Learning Contemplation (SIRLC) that improves LLMs without reliance on external labels. Our approach is grounded in the observation that it is simpler for language models to assess text quality than to generate text. Building on this insight, SIRLC assigns LLMs dual roles as both student and teacher. As a student, the LLM generates answers to unlabeled questions, while as a teacher, it evaluates the generated text and assigns scores accordingly. The model parameters are updated using reinforcement learning to maximize the evaluation score. We demonstrate that SIRLC can be applied to various NLP tasks, such as reasoning problems, text generation, and machine translation. Our experiments show that SIRLC effectively improves LLM performance without external supervision, resulting in a 5.6% increase in answering accuracy for reasoning tasks and a rise in BERTScore from 0.82 to 0.86 for translation tasks. Furthermore, SIRLC can be applied to models of different sizes, showcasing its broad applicability.
大型语言模型(LLM)在多种自然语言处理任务(NLP任务)中表现出卓越的性能。然而,精细调整这些模型通常需要充分的监督,这可能会变得非常昂贵和耗时。本文介绍了一种名为“语言模型自我强化学习协商(SIRLC)”的全新的无监督方法,该方法无需外部标签来提高LLM的性能。我们的研究方法基于观察,即语言模型评估文本质量比生成文本更简单。基于这一洞察力,SIRLC将LLM赋予双重角色,既是学生也是教师。作为学生,LLM生成未标记的问题答案,作为教师,它评估生成的文本并相应地分配分数。模型参数通过强化学习更新以最大化评估分数。我们证明了SIRLC可以应用于各种NLP任务,例如推理问题、文本生成和机器翻译。我们的实验表明,SIRLC在没有外部监督的情况下有效地提高了LLM的性能,推理任务回答准确率的增加率为5.6%,翻译任务BERTScore的提升率为0.82到0.86。此外,SIRLC可以应用于不同大小的语言模型,展示其广泛的适用性。
https://arxiv.org/abs/2305.14483
Transformer-based pretrained models like BERT, GPT-2 and T5 have been finetuned for a large number of natural language processing (NLP) tasks, and have been shown to be very effective. However, while finetuning, what changes across layers in these models with respect to pretrained checkpoints is under-studied. Further, how robust are these models to perturbations in input text? Does the robustness vary depending on the NLP task for which the models have been finetuned? While there exists some work on studying robustness of BERT finetuned for a few NLP tasks, there is no rigorous study which compares this robustness across encoder only, decoder only and encoder-decoder models. In this paper, we study the robustness of three language models (BERT, GPT-2 and T5) with eight different text perturbations on the General Language Understanding Evaluation (GLUE) benchmark. Also, we use two metrics (CKA and STIR) to quantify changes between pretrained and finetuned language model representations across layers. GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbation. Although models exhibit good robustness broadly, dropping nouns, verbs or changing characters are the most impactful. Overall, this study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models which should be kept in mind when passing inputs.
Transformer-based预训练模型,如BERT、GPT-2和T5,已经为许多自然语言处理(NLP)任务进行了微调,并表明非常有效。然而,在微调期间,这些模型在不同层上的预训练 checkpoint 的变化是缺乏研究的。此外,这些模型对输入文本的perturbations的鲁棒性如何?这种鲁棒性是否取决于模型微调的任务?虽然有一些研究研究了BERT微调对某些NLP任务的稳定性,但没有任何一种严谨的研究比较了Encoder-decoder模型、只编码器和只解码器之间的稳定性。在本文中,我们研究了三个语言模型(BERT、GPT-2和T5)在不同基准上的稳定性,使用了两种指标(CKA和 STIR)量化了预训练和微调语言模型表示之间的变化。GPT-2表示在多种输入perturbation类型上比BERT和T5更加鲁棒。虽然模型广泛表现出良好的鲁棒性,但删除名词、动词或改变字符是最具影响的变化。总的来说,这项研究提供了perturbation特定的弱点,这些弱点在输入时应该引起注意。
https://arxiv.org/abs/2305.14453
Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific instructions. However, due to the computational demands associated with training these models, their applications often rely on zero-shot settings. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiment considers the impact of prompt complexity, including the effect of incorporating label definitions into the prompt, using synonyms for label names, and the influence of integrating past memories during the foundation model training. The findings indicate that in a zero-shot setting, the current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
训练有指令调整的大型语言模型(LLMs)表现出令人印象深刻的语言理解和能够根据特定指令生成响应的能力。然而,由于训练这些模型所需的计算要求,其应用范围往往依赖于零次输入设置。在本文中,我们评估了在计算社会科学分类任务中公开可用的两个LLM,ChatGPT和OpenAssistant的零次输入性能,同时研究不同提示策略的影响。我们的实验考虑了提示的复杂性影响,包括将标签定义融入提示、使用标签名称的同义词以及在基线模型训练期间整合过去的记忆的影响。结果表明,在零次输入设置下,当前LLMs无法与小型微调的基准Transformer模型(如BERT)的性能相匹配。此外,我们发现不同的提示策略能够显著影响分类准确性,准确性和F1得分的变化超过10%。
https://arxiv.org/abs/2305.14310