Abstract
This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at this https URL
Abstract (translated)
本文探讨了使用高斯点绘的遮蔽自编码器(Masked Autoencoders,简称MAE)技术。虽然像MAE这样的重建式自监督学习框架能够学到良好的语义抽象,但它并没有针对明确的空间感知进行训练。我们的方法,名为高斯遮蔽自编码器或GMAE,旨在同时学习语义抽象和空间理解。与MAE类似,它在像素空间中端到端地重构图像,但除此之外,还引入了一个基于3D高斯的中间表示,并通过点绘来渲染图像。我们展示了GMAE可以实现各种零样本学习的空间理解能力(例如前景-背景分割、图像分层、边缘检测等),同时保持了来自MAE的自监督表示质量的高级语义信息。据我们所知,这是首次在基于优化方法的单场景重构之外,在图像表示学习框架中使用高斯原语。我们认为GMAE将激发进一步的研究,并为开发用于建模高质量视觉数据的下一代技术做出贡献。更多详情请访问此链接:[https URL](原文中的URL被替换成了占位符)
URL
https://arxiv.org/abs/2501.03229