Abstract
When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.
Abstract (translated)
在将深度神经网络应用于机器人或其他物理系统时,学习到的模型应可靠地量化预测的不确定性。可靠的不确定性允许下游模块评估其行动的安全性。在这项工作中,我们关注评估这种不确定性的指标。具体来说,我们关注回归任务,并研究了稀疏化误差(AUSE)、标定误差、斯皮尔曼相关系数和负对数似然(NLL)。使用合成回归数据集,我们研究了这四个典型类型不确定性下,这些指标的行为,以及它们关于测试集大小的稳定性,并揭示了它们的优缺点。我们的结果表明,标定误差是最稳定且最易解释的指标,但AUSE和NLL也有各自的适用场景。我们劝诫使用斯皮尔曼等级相关系数来评估不确定性,并建议用AUSE来代替它。
URL
https://arxiv.org/abs/2405.04278