Abstract
Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of sequence generation models. Inseq enables intuitive and optimized extraction of models' internal information and feature importance scores for popular decoder-only and encoder-decoder Transformers architectures. We showcase its potential by adopting it to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Thanks to its extensible interface supporting cutting-edge techniques such as contrastive feature attribution, Inseq can drive future advances in explainable natural language generation, centralizing good practices and enabling fair and reproducible model evaluations.
Abstract (translated)
过去在自然语言处理可解释性方面,主要关注流行分类任务,而忽视了生成设置,这部分是因为缺乏专门工具。在本研究中,我们介绍了Inseq,一个Python库,以使普通序列生成模型的可解释性分析更容易获得。Inseq能够直觉优化地提取模型的内部信息和特征重要性评分,适用于流行的解码-编码Transformer架构。我们使用Inseq来展示其潜力,以突出机器翻译模型中的性别偏见,并指出GPT-2中的实际知识。由于其扩展接口支持先进的技术,如对比特征归因,Inseq可以推动可解释自然语言生成的未来进展,集中化良好实践,并使公平且可重复的模型评估变得可能。
URL
https://arxiv.org/abs/2302.13942