Abstract
Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility.\footnote{this https URL}.
Abstract (translated)
Transformer结构在自然语言处理(NLP)、计算机视觉(CV)和信息检索(IR)等多个应用机器学习领域取得了巨大的成功。Transformer架构的核心机制--关注,在训练和推理过程中需要分别有$O(n^2)$和$O(n)$的时间复杂度。为了提高注意机制的可扩展性,已经提出了许多工作,如Flash Attention和Multi-query Attention。另一类工作旨在设计新的机制来替代注意力。最近,一个有代表性的模型结构--Mamba(基于状态空间模型),在多个序列建模任务上实现了与Transformer等同的性能。在这篇论文中,我们通过古典IR任务--文档排序,对Mamba的效率进行了评估。一个重新排序器模型接收查询和文档作为输入,并预测一个标量相关分数。这个任务要求语言模型能够理解长篇上下文输入,并捕捉查询和文档标记之间的相互作用。我们发现:(1)Mamba模型在相同训练方法下的竞争性能与Transformer模型相当;(2)但是,与高效的Transformer实现(如Flash Attention)相比,训练通过度较低。我们希望这项研究可以为探索Mamba模型在其他古典IR任务提供起点。我们的代码实现和训练checkpoints是公开的,以促进可重复性。 这个链接:https://github.com/jiexuanzeng/transformer-IR
URL
https://arxiv.org/abs/2403.18276