Paper Reading AI Learner

RankMamba, Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

2024-03-27 06:07:05
Zhichao Xu

Abstract

Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility.\footnote{this https URL}.

Abstract (translated)

Transformer结构在自然语言处理(NLP)、计算机视觉(CV)和信息检索(IR)等多个应用机器学习领域取得了巨大的成功。Transformer架构的核心机制--关注,在训练和推理过程中需要分别有$O(n^2)$和$O(n)$的时间复杂度。为了提高注意机制的可扩展性,已经提出了许多工作,如Flash Attention和Multi-query Attention。另一类工作旨在设计新的机制来替代注意力。最近,一个有代表性的模型结构--Mamba(基于状态空间模型),在多个序列建模任务上实现了与Transformer等同的性能。在这篇论文中,我们通过古典IR任务--文档排序,对Mamba的效率进行了评估。一个重新排序器模型接收查询和文档作为输入,并预测一个标量相关分数。这个任务要求语言模型能够理解长篇上下文输入,并捕捉查询和文档标记之间的相互作用。我们发现:(1)Mamba模型在相同训练方法下的竞争性能与Transformer模型相当;(2)但是,与高效的Transformer实现(如Flash Attention)相比,训练通过度较低。我们希望这项研究可以为探索Mamba模型在其他古典IR任务提供起点。我们的代码实现和训练checkpoints是公开的,以促进可重复性。 这个链接:https://github.com/jiexuanzeng/transformer-IR

URL

https://arxiv.org/abs/2403.18276

PDF

https://arxiv.org/pdf/2403.18276.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot