Abstract
Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly ambitious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.
Abstract (translated)
在自然图像的生成采样方面取得显著成就的基础上,我们提出了一个创新挑战,可能过于野心勃勃,涉及生成整个多维时间序列的样本,使其类似于图像。然而,统计挑战在于小样本量,有时可能仅包含几百个样本。这个问题尤其对于遵循从规范分布生成样本并解码或去噪以匹配真实数据分布的深度生成模型来说具有挑战性。相反,我们的方法基于信息论,旨在隐含地描述图像的分布,特别是像素之间的(全局和局部)依赖关系。我们通过经验估计其KL散度与各自边际分布的dual形式相对,从而实现这一目标。这使得我们能够在优化后的1维dual divergence空间中直接进行生成采样。具体来说,在dual空间中,表示数据分布的训练样本嵌入在两个端点之间的各种聚类中。在理论上,任何嵌入在这两个端点之间的样本都与数据分布处于同一分布中。我们生成图像新样本的关键思想是根据数据维度的梯度在聚类之间进行平滑。除了直接采样所带来的数据效率之外,我们还提出了一种估计数据分布与边际分布之间差异的算法,该算法在估计数据分布与边际分布之间的差异方面具有显著的降低样本复杂度的效果。我们通过使用多种现实世界数据集来验证这一方法,建立了它与最先进的深度学习方法相比具有优越性的理论保证和实际评估。
URL
https://arxiv.org/abs/2404.07377