Paper Reading AI Learner

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

2026-02-07 06:27:11
Hoang Anh Duy Le (Rice University), Sahil Joshi (Rice University), Zeyu Yang (Rice University), Zhaozhuo Xu (Stevens Institute of Technology), Anshumali Shrivastava (Rice University)

Abstract

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.

Abstract (translated)

自注意力机制在长上下文大规模语言模型(LLM)的推理过程中,无论是预填充阶段还是解码阶段,都是计算和内存消耗的主要来源。为了解决这一挑战,我们引入了Sketch&Walk Attention,这是一种无需训练的稀疏注意力方法,通过轻量级草图和确定性遍历来决定稀疏度。Sketch&Walk利用哈达玛(Hadamard)抽样技术获取注意力分数的廉价近似值,然后通过一种机制将这些估计值在不同层之间进行聚合,该机制能够捕捉到超出直接令牌间交互的影响。累积后的遍历得分用于选择top-k注意力块,使得这种方法能够在无需重新训练的情况下动态调整稀疏度,并且均匀应用于预填充阶段和解码阶段,同时结合了定制的稀疏注意力内核。 在一系列模型和任务中,Sketch&Walk能够在20%的关注密度下保持几乎无损的准确性,在某些设置下甚至可以略微优于密集注意机制的表现。同时,它能够实现高达6倍的推理速度提升。

URL

https://arxiv.org/abs/2602.07397

PDF

https://arxiv.org/pdf/2602.07397.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot