Scaling Learning based Policy Optimization for Temporal Tasks via Dropout

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment. We desire the trained policy to ensure that the agent satisfies specific task objectives, expressed in discrete-time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback controllers, and we assume a feed forward neural network for learning these feedback controllers. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and naïve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. We show that, the existing smooth semantics for robustness are inefficient regarding gradient computation when the specification becomes complex. To address this challenge, we propose a new smooth semantics for DT-STL that under-approximates the robustness value and scales well for backpropagation over a complex specification. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.

Abstract (translated)

本文提出了一种基于模型的训练反馈控制器训练方法，用于在高度非线性环境中训练自主智能体。我们希望训练后的策略确保智能体满足特定的任务目标，用离散时间信号时序逻辑（DT-STL）表示。通过形式框架（如DT-STL）重新定义任务的优点是可以实现定量的满足语义。换句话说，给定轨迹和DT-STL公式，我们可以计算鲁棒性，可以解释为轨迹与满足该公式的轨迹集合之间的近似有符号距离。我们使用反馈控制器，并假设用于学习这些反馈控制器的深度神经网络。我们证明了这种学习问题与训练递归神经网络（RNNs）类似，其中递归单元的数量与智能体任务目标的时域约束成正比。这提出了一个挑战：RNNs容易受到梯度消失和爆炸的影响，因此基于梯度的 naive 梯度下降策略解决长时任务目标也会面临相同的问题。为解决这个问题，我们引入了一种基于停顿或梯度采样原理的新颖梯度逼近算法。我们证明了，当规范变得复杂时，现有鲁棒性语义是低效的。为了应对这个挑战，我们提出了一个新的基于梯度的DT-STL语义，它低于鲁棒性值，并且在复杂规范下具有良好的可扩展性。我们还证明了我们的控制合成方法论对于使用随机梯度下降（SGD）更有效地收敛并实现可扩展的时延梯度传播在具有高维状态空间的长时任务上很有帮助。

URL

https://arxiv.org/abs/2403.15826

PDF

https://arxiv.org/pdf/2403.15826.pdf

Scaling Learning based Policy Optimization for Temporal Tasks via Dropout

Abstract

Abstract (translated)

URL

PDF Copy

PDF