Paper Reading AI Learner

Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing

2025-09-10 02:03:06
Miao Cao, Siming Zheng, Lishun Wang, Ziyang Chen, David Brady, Xin Yuan

Abstract

Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.

Abstract (translated)

数字相机每捕获和编码一个视频像素大约消耗0.1微焦耳的能量,这意味着对于30帧/秒运行的4K传感器来说,其功耗约为20瓦。设想一下千兆像素级别的摄像机以100到1000帧/秒的速度工作时,现有的处理模型显然不可持续。为了解决这个问题,人们提出了在物理层面上采用压缩测量来将每个像素的能量消耗减少10至100倍的方案。 视频快照压缩成像(Video Snapshot Compressive Imaging, SCI)通过在光学传感器层面引入高频调制技术,提高了有效帧率。一种常见的视频SCI采样策略是随机采样(Random Sampling, RS),其中每个掩模元素值被随机设置为0或1。同样地,图像修复(Image Inpainting, I2P)展示了仅从一部分像素中就可以恢复出完整图像的可能性。 受到I2P的启发,我们提出了一种超稀疏采样(Ultra-Sparse Sampling, USS)策略,在这一策略下,每个空间位置只有一个子帧被设置为1,其余所有都设为0。随后,我们构建了一个基于数字微镜设备(Digital Micromirror Device, DMD)的编码系统来验证USS策略的有效性。理想情况下,我们可以将USS测量分解成子测量,并利用I2P算法恢复高速图像帧。然而,由于DMD和CCD之间的不匹配问题,USS测量无法完美地被分解。 为此,我们提出了BSTFormer(Block、Sparse、Temporal Transformer),一种稀疏Transformer,它使用局部块注意机制、全局稀疏注意机制以及全局时间注意机制来利用USS测量的稀疏性。在模拟和真实世界数据集上的广泛实验结果表明,我们的方法显著优于所有先前的最佳算法。 此外,USS策略的一个关键优势是其动态范围比RS策略更大。从应用角度来看,由于固定曝光时间的特点,USS策略是一个实施完整视频SCI系统芯片的良好选择。

URL

https://arxiv.org/abs/2509.08228

PDF

https://arxiv.org/pdf/2509.08228.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot