Abstract
While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at this https URL.
Abstract (translated)
虽然一阶段扩散模型在感知图像压缩方面近期表现出色,但它们在视频压缩中的应用仍然有限。此前的努力通常依赖于预先训练的2D自编码器,这些自编码器会独立生成每一帧的潜在表示,从而忽略了时间上的依赖关系。我们提出了YODA——一种基于一阶段扩散的视频压缩器——它将多尺度特性从时间参考中嵌入到潜在生成和潜在编码过程中,以更好地利用空间-时间相关性来实现更为紧凑的表示,并采用线性扩散变换器(DiT)来进行高效的一步去噪。YODA在LPIPS、DISTS、FID和KID等指标上达到了感知性能的最佳水平,持续优于传统方法和深度学习基准。源代码将在以下网址公开发布:[此链接](https://this https URL)。 请注意,最后的网址需要你提供具体的URL地址来替换占位符“this https URL”。
URL
https://arxiv.org/abs/2601.01141