Paper Reading AI Learner

Context-Aware Token Selection and Packing for Enhanced Vision Transformer

2024-10-31 03:47:27
Tianyi Zhang, Baoxin Li, Jae-sun Seo, Yu Cap

Abstract

In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.

Abstract (translated)

近年来,视觉变压器的长距离注意力机制在各种计算机视觉任务中推动了显著的性能突破。然而,传统的自注意力机制通过处理信息量大和信息量小的标记,存在效率低下和不准确的问题。虽然稀疏注意力机制已经被引入以减轻这些问题,通过剪枝参与注意力的标记来实现,但它们往往缺乏上下文感知能力和智能。这些机制通常对不同的输入应用统一的标记选择策略来进行批量训练,或者仅在推理阶段优化效率。为了解决这些问题,我们提出了一种新算法:选择和打包注意力(SPA)。SPA使用一个由选择标签监督的低成本门控层动态选择信息量大的标记,并将这些标记打包成新的批次,使得可以在并行化的GPU批量训练和推理中使用可变数量的标记。在多样数据集和计算机视觉任务上的广泛实验表明,SPA提供了优越的性能和效率,包括目标检测中的0.6 mAP改进和16.4%的计算成本降低。

URL

https://arxiv.org/abs/2410.23608

PDF

https://arxiv.org/pdf/2410.23608.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot