Input-length-shortening and text generation via attention values

Abstract
Abstract (translated)
URL
PDF

Abstract

Identifying words that impact a task's performance more than others is a challenge in natural language processing. Transformers models have recently addressed this issue by incorporating an attention mechanism that assigns greater attention (i.e., relevance) scores to some words than others. Because of the attention mechanism's high computational cost, transformer models usually have an input-length limitation caused by hardware constraints. This limitation applies to many transformers, including the well-known bidirectional encoder representations of the transformer (BERT) model. In this paper, we examined BERT's attention assignment mechanism, focusing on two questions: (1) How can attention be employed to reduce input length? (2) How can attention be used as a control mechanism for conditional text generation? We investigated these questions in the context of a text classification task. We discovered that BERT's early layers assign more critical attention scores for text classification tasks compared to later layers. We demonstrated that the first layer's attention sums could be used to filter tokens in a given sequence, considerably decreasing the input length while maintaining good test accuracy. We also applied filtering, which uses a compute-efficient semantic similarities algorithm, and discovered that retaining approximately 6\% of the original sequence is sufficient to obtain 86.5\% accuracy. Finally, we showed that we could generate data in a stable manner and indistinguishable from the original one by only using a small percentage (10\%) of the tokens with high attention scores according to BERT's first layer.

Abstract (translated)

识别影响任务表现的某些单词是自然语言处理中的一项挑战。Transformer模型最近解决了这个问题,通过引入一种关注机制,给某些单词赋予更高的关注(即相关性)分数。由于关注机制的高计算成本,Transformer模型通常由于硬件限制而存在输入长度的限制。这适用于许多Transformer模型,包括Transformer模型著名的双向编码器表示(BERT)模型。在本文中,我们研究了BERT的注意力分配机制,重点是两个问题:(1)如何应用关注以减少输入长度?(2)如何应用关注作为条件文本生成控制机制?我们在文本分类任务的背景下研究了这些问题。我们发现,BERT的早期层对文本分类任务赋予更高的关注得分,而后期层则更加关注。我们展示了第一个层的关注总和可以用来过滤给定序列中的 tokens,显著减少输入长度,同时保持较好的测试精度。我们还应用了一种计算效率高语义相似度算法的过滤器,并发现保留约6%的原序列是足够的,以获得86.5%的准确率。最后,我们展示了我们可以通过只使用BERT第一个层中高度关注单词的一小部分(约10%)来稳定地生成数据,并且与原始数据几乎无区别,从而证明了我们的方法。

URL

https://arxiv.org/abs/2303.07585

PDF

https://arxiv.org/pdf/2303.07585.pdf