Paper Reading AI Learner

YODA: Yet Another One-step Diffusion-based Video Compressor

2026-01-03 10:12:07
Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, Zhan Ma

Abstract

While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at this https URL.

Abstract (translated)

虽然一阶段扩散模型在感知图像压缩方面近期表现出色,但它们在视频压缩中的应用仍然有限。此前的努力通常依赖于预先训练的2D自编码器,这些自编码器会独立生成每一帧的潜在表示,从而忽略了时间上的依赖关系。我们提出了YODA——一种基于一阶段扩散的视频压缩器——它将多尺度特性从时间参考中嵌入到潜在生成和潜在编码过程中,以更好地利用空间-时间相关性来实现更为紧凑的表示,并采用线性扩散变换器(DiT)来进行高效的一步去噪。YODA在LPIPS、DISTS、FID和KID等指标上达到了感知性能的最佳水平,持续优于传统方法和深度学习基准。源代码将在以下网址公开发布:[此链接](https://this https URL)。 请注意,最后的网址需要你提供具体的URL地址来替换占位符“this https URL”。

URL

https://arxiv.org/abs/2601.01141

PDF

https://arxiv.org/pdf/2601.01141.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot