Abstract
Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
Abstract (translated)
深度特征是计算机视觉研究的核心,捕捉图像语义并使社区能够在零或少数样本的情况下解决下游任务。然而,这些特征通常缺乏进行如分割和深度预测等直接密集预测任务的空间分辨率,因为模型在大型区域上积极抽取信息。在这项工作中,我们引入了FeatUp,一个任务和模型无关的框架,用于在深度特征中恢复丢失的时空信息。我们引入了两种FeatUp变体:一种在单前向传递中指导具有高分辨率信号的特征,另一种将隐式模型适配于单个图像以重构任何分辨率下的特征。两种方法都使用深度模拟NeRFs的多视图一致性损失。我们的特征保留其原始语义,可以交换到现有的应用程序中,甚至在没有重新训练的情况下实现分辨率和性能的提升。我们证明了FeatUp在类激活图生成、用于分割和深度预测的迁移学习以及语义分割的端到端训练方面显著优于其他特征放大和图像超分辨率方法。
URL
https://arxiv.org/abs/2403.10516