Abstract
Implicit neural representations (INR) excel in encoding videos within neural networks, showcasing promise in computer vision tasks like video compression and denoising. INR-based approaches reconstruct video frames from content-agnostic embeddings, which hampers their efficacy in video frame regression and restricts their generalization ability for video interpolation. To address these deficiencies, Hybrid Neural Representation for Videos (HNeRV) was introduced with content-adaptive embeddings. Nevertheless, HNeRV's compression ratios remain relatively low, attributable to an oversight in leveraging the network's shallow features and inter-frame residual information. In this work, we introduce an advanced U-shaped architecture, Vector Quantized-NeRV (VQ-NeRV), which integrates a novel component--the VQ-NeRV Block. This block incorporates a codebook mechanism to discretize the network's shallow residual features and inter-frame residual information effectively. This approach proves particularly advantageous in video compression, as it results in smaller size compared to quantized features. Furthermore, we introduce an original codebook optimization technique, termed shallow codebook optimization, designed to refine the utility and efficiency of the codebook. The experimental evaluations indicate that VQ-NeRV outperforms HNeRV on video regression tasks, delivering superior reconstruction quality (with an increase of 1-2 dB in Peak Signal-to-Noise Ratio (PSNR)), better bit per pixel (bpp) efficiency, and improved video inpainting outcomes.
Abstract (translated)
隐式神经表示(INR)在编码视频方面表现出色,展示了在视频压缩和去噪等计算机视觉任务中的潜力。基于INR的方法从内容无关的嵌入中重构视频帧,这会削弱他们在视频帧回归和视频插值方面的效果,并限制其通用能力。为解决这些不足,我们引入了Hybrid Neural Representation for Videos(HNeRV),它使用内容自适应嵌入。然而,HNeRV的压缩比仍然相对较低,这是由于在利用网络的浅层特征和跨帧残差信息方面存在疏漏。在这项工作中,我们引入了一种先进的U型架构,称为Vector Quantized-NeRV(VQ-NeRV),它包含一个新颖的组件——VQ-NeRV块。这个块采用了一种有效的编码方案来离散化网络的浅层残差特征和跨帧残差信息。这种方法在视频压缩方面尤其优越,因为结果是相比量化特征更小的尺寸。此外,我们还引入了一种原始代码本优化技术,称为浅层代码本优化,旨在优化代码本的效用和效率。实验评估结果表明,VQ-NeRV在视频回归任务中优于HNeRV,实现了卓越的重建质量(在峰值信号-噪声比(PSNR)上增加1-2 dB),更好的每像素(bpp)效率和改善的视频修复效果。
URL
https://arxiv.org/abs/2403.12401