Paper Reading AI Learner

DualX-VSR: Dual Axial Spatial$times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

2025-06-05 09:53:44
Shuo Cao, Yihao Liu, Xiaohui Li. Yuanting Gao. Yu Zhou, Chao Dong

Abstract

Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.

Abstract (translated)

基于Transformer的模型,如ViViT和TimeSformer,通过有效地建模空间-时间依赖性,极大地推进了视频理解的发展。最近的视频生成模型,例如Sora和Vidu,则进一步突显了变压器在长程特征提取和整体空间-时间建模方面的强大能力。然而,直接将这些模型应用于现实世界的视频超分辨率(VSR)颇具挑战,因为VSR需要像素级别的精度,而令牌化和序列注意机制可能会削弱这一点。尽管最近基于Transformer的VSR模型试图通过使用较小的补丁和局部注意力来解决这些问题,但它们仍然面临着诸如受限感受野以及依赖于光流对齐所带来的可能在现实世界设置中引入不准确性的限制。 为了解决这些挑战,我们提出了用于真实世界视频超分辨率(DualX-VSR)的双重轴向空间$\times$时间变压器。该模型引入了一种新颖的双轴向空间$\times$时间注意力机制,能够沿正交方向整合空间和时间信息。通过这种方法,DualX-VSR消除了对运动补偿的需求,提供了一个结构简化、能为时空信息提供连贯表示的方法。因此,在真实世界的VSR任务中,DualX-VSR实现了高保真度,并取得了卓越的性能表现。 简而言之,DualX-VSR通过引入新颖的空间$\times$时间注意力机制来优化视频超分辨率技术,这种方法不仅可以更有效地处理时空信息,还可以简化结构并避免运动补偿带来的误差。这使得模型在真实世界的应用中能表现出更高的精确性和更好的性能。

URL

https://arxiv.org/abs/2506.04830

PDF

https://arxiv.org/pdf/2506.04830.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot