Paper Reading AI Learner

GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models

2024-06-18 14:44:12
Yongtao Ge, Guangkai Xu, Zhiyue Zhao, Libo Sun, Zheng Huang, Yanlong Sun, Hao Chen, Chunhua Shen

Abstract

Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications.

Abstract (translated)

近年来,在区分性和生成性预训练方面取得了进步,使得几何估计模型具有很强的泛化能力。然而,区分性单目几何估计方法依赖于大型细粒度数据集以实现零散样本泛化,而几种基于生成的方法在利用预训练扩散模型和在小规模合成训练数据上进行微调时,展现了实现令人印象深刻的泛化性能的潜力。然而,这些模型在不同的数据集上使用不同的食谱进行训练,使得确定评估性能的关键因素变得困难。此外,目前的几何评估基准有两个主要缺点,可能会阻碍该领域的发展,即场景多样性有限和标签质量不佳。为解决这些问题,我们将在统一的代码库中构建公平和强大的基准,用于评估和分析几何估计模型;在几何估计任务中使用更多具有多样场景和高质量注释的挑战性基准进行评估;我们的结果表明,使用大量数据预训练的具有区分性的模型,如DINOv2,在相同的训练配置下,可以比生成性对应物取得更好的性能,这表明高质量的训练数据比数据规模和模型架构更重要。我们的观察也引发了一个问题:如果我们仅仅使用很少的合成深度数据对通用视觉模型如DINOv2进行微调,就会产生最先进的结果,那么我们真的需要复杂生成模型来进行深度估计吗?我们相信,这项工作可以为几何估计任务以及各种下游应用的发展推动前进。

URL

https://arxiv.org/abs/2406.12671

PDF

https://arxiv.org/pdf/2406.12671.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot