Abstract
Depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Whereas impressive performances have been reported in this area recently using end-to-end trained deep neural architectures, as to what cues in the images that are being exploited by these black box systems is hard to know. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, colour and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimise depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.
Abstract (translated)
从单个图像中进行深度估计是一个在计算机视觉领域具有挑战性的问题,因为缺乏双目差异或运动信息。相比之下,最近使用端到端训练的深度神经架构在這個領域取得了令人印象深刻的表現,但是這些黑盒系統利用的圖像中的提示是很难知道的。因此,在本文中,我們使用一組室內場景數據集來 quantify在單目深度估计設置中已知深度提示的相對貢獻。我們的 work 使用特徵提取技術將單一特徵的形狀、紋理、色彩和飽和度之間的關係與預測深度建立聯繫。我們發現,從邊緣檢測中提取的對象的形狀對室內環境中的其他特徵來說有更大的貢獻,而其他特徵在其他程度上也有貢獻。這些見解將有助於優化深度估計模型,提高其準確性和稳健性。它們有望拓宽基於視覺深度的應用範圍。項目代碼已附於補充材料中,並將发表在GitHub上。
URL
https://arxiv.org/abs/2311.10042