Abstract
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
Abstract (translated)
归纳偏见在解离表示学习中的关键作用在于缩小不明确解集。在这项工作中,我们考虑为神经网络自编码器赋予来自文献中的三个选择性归纳偏见:通过量化将数据压缩到类似于网格状的潜在空间,以及在潜在之间实现集体独立,以及对任何潜在对其他潜在如何确定数据生成的最小功能影响。原则上,这些归纳偏见是深刻互补的:它们最直接地指定潜在空间的性质、编码器和解码器的属性。然而,在实践中,简单地组合现有的技术实例这些归纳偏见往往无法带来显著的益处。为了解决这个问题,我们提出了三种适应技术,简化学习问题,为关键正则化项分配稳定不变性,以及遏制退化激励。所得到的模型Tripod在四个图像解离基准测试中实现了最先进的结果。我们还验证了Tripod在它的原始形式上明显优于 naive 版本,而且它的三个“腿”对于最佳性能都是必要的。
URL
https://arxiv.org/abs/2404.10282