Abstract
Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.
Abstract (translated)
自注意力机制在长上下文大规模语言模型(LLM)的推理过程中,无论是预填充阶段还是解码阶段,都是计算和内存消耗的主要来源。为了解决这一挑战,我们引入了Sketch&Walk Attention,这是一种无需训练的稀疏注意力方法,通过轻量级草图和确定性遍历来决定稀疏度。Sketch&Walk利用哈达玛(Hadamard)抽样技术获取注意力分数的廉价近似值,然后通过一种机制将这些估计值在不同层之间进行聚合,该机制能够捕捉到超出直接令牌间交互的影响。累积后的遍历得分用于选择top-k注意力块,使得这种方法能够在无需重新训练的情况下动态调整稀疏度,并且均匀应用于预填充阶段和解码阶段,同时结合了定制的稀疏注意力内核。 在一系列模型和任务中,Sketch&Walk能够在20%的关注密度下保持几乎无损的准确性,在某些设置下甚至可以略微优于密集注意机制的表现。同时,它能够实现高达6倍的推理速度提升。
URL
https://arxiv.org/abs/2602.07397