Abstract
Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies there is no guarantee that the learners generate a decision that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal decision tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
Abstract (translated)
强化学习政策的可解释性对于许多实际任务是至关重要的,但学习这样的可解释性政策是一个困难的问题。特别是基于规则的政策,如决策树和规则列表,由于它们的不变性难以优化。虽然现有的技术可以学习可验证的决策树政策,但没有任何保证学习的是最优的决策。在这项工作中,我们研究了基于限制的决策树优化对于马尔可夫决策过程(MPDs)的实现,并提出了ODMTTS:最优MDP决策树。给定一个用户定义的大小限制和MDP的构建,ODMTTS使用混合整数线性 Programming直接最大化决策树的期望折扣回报。通过训练不同的MDP的最优决策树政策,我们经验证了现有模仿学习技术的性能差距,并发现它们的表现不如最优。我们表明,这是模仿学习固有的缺点所导致的,即复杂的政策不能用大小限制的树来表示。在这种情况下,最好直接优化树的期望回报。虽然机器学习模型的性能与解释性之间存在权衡,但我们发现,ODMTTS限制在深度为3时通常表现接近最优极限。
URL
https://arxiv.org/abs/2301.13185