Abstract
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
Abstract (translated)
Depth Anything 在单目深度估计方面取得了显著的成功,并且具有强大的泛化能力。然而,它在视频中存在时间上的不一致性问题,这限制了其实际应用。为了缓解这个问题,已经提出了多种方法,这些方法通过利用视频生成模型或从光流和相机姿态引入先验信息来实现。尽管如此,这些方法仅适用于短视频(< 10秒),并且需要在质量和计算效率之间进行权衡。我们提出了一种名为 Video Depth Anything 的方法,该方法可以在超长视频(超过几分钟)上提供高质量、一致的深度估计,并且不会牺牲效率。 我们的模型基于 Depth Anything V2 构建,并用一个高效的时空头替换了它的头部。我们设计了一个简单而有效的时序一致性损失函数,通过限制时间深度梯度来实现这一目标,从而消除了对额外几何先验的需求。该模型在视频深度和未标记图像的联合数据集上进行训练,类似于 Depth Anything V2 的方法。此外,为长视频推理开发了一种新颖的关键帧策略。 实验表明,我们的模型可以应用于任意长度的视频而不损害质量、一致性和泛化能力。在多个视频基准上的综合评估显示,我们的方法在零样本视频深度估计方面设立了新的最先进的水平。我们提供了不同规模的模型来支持各种场景,其中最小的模型能够在 30 FPS 的情况下实现实时性能。
URL
https://arxiv.org/abs/2501.12375