Abstract
U-Nets are among the most widely used architectures in computer vision, renowned for their exceptional performance in applications such as image segmentation, denoising, and diffusion modeling. However, a theoretical explanation of the U-Net architecture design has not yet been fully established. This paper introduces a novel interpretation of the U-Net architecture by studying certain generative hierarchical models, which are tree-structured graphical models extensively utilized in both language and image domains. With their encoder-decoder structure, long skip connections, and pooling and up-sampling layers, we demonstrate how U-Nets can naturally implement the belief propagation denoising algorithm in such generative hierarchical models, thereby efficiently approximating the denoising functions. This leads to an efficient sample complexity bound for learning the denoising function using U-Nets within these models. Additionally, we discuss the broader implications of these findings for diffusion models in generative hierarchical models. We also demonstrate that the conventional architecture of convolutional neural networks (ConvNets) is ideally suited for classification tasks within these models. This offers a unified view of the roles of ConvNets and U-Nets, highlighting the versatility of generative hierarchical models in modeling complex data distributions across language and image domains.
Abstract (translated)
U-Net是一种在计算机视觉领域应用最广泛的架构之一,以其在图像分割、去噪和扩散建模等应用中的卓越性能而闻名。然而,U-Net架构的设计尚未得到完全的理论解释。本文通过研究某些被广泛用于语言和图像领域的生成层次模型,引入了一种新颖的解释U-Net架构的方法。这些生成层次模型具有编码器-解码器结构、长距离跳跃连接和池化与上采样层。我们证明了U-Nets可以自然地实现信念传播去噪算法在这些生成层次模型中,从而高效地逼近去噪函数。这使得使用U-Nets在这些模型中学习去噪函数的样本复杂ity bound变得尤为高效。此外,我们讨论了这些发现对这些扩散模型在生成层次模型中的影响。我们还证明了ConvNets这种传统的神经网络架构非常适合这些模型中的分类任务。这提供了一个统一的视角,强调了生成层次模型在语言和图像领域建模复杂数据分布的灵活性。
URL
https://arxiv.org/abs/2404.18444