Abstract
Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).
Abstract (translated)
准确地模拟不同场景中异质代理人的多样行为是自动驾驶模拟的基本任务。由于行为分布的多维度性、驾驶场景的高维度性、分布变化和信息不完整,这个任务具有挑战性。我们的第一个洞察是通过可导模拟通过状态匹配提供有意义的学习信号,实现策略的有效分配。这可以通过揭示梯度高速公路和异质代理人的梯度通道来证明。然而,在低密度区域中发现了梯度爆炸和弱监督的问题。我们的第二个洞察是,通过应用双策略 regularization 缩小函数空间可以解决这些问题。再考虑多样性,我们第三个洞察是,数据集中的异质代理人的行为可以用一系列原型向量有效地压缩检索。这导致基于模型的强化模仿学习框架(MRIC)。MRIC 引入了开环模型基于仿真的 regularization 以稳定训练,以及基于模型的强化学习 (RL) 基于领域知识的 regularization。RL regularization 涉及可导的 Minkowskidifference-based 碰撞避免和基于投影的道路和交通规则遵守奖励。还进一步提出了动态乘数机制,消除 regularization 的干扰,同时确保其有效性。使用大型 Waymo 开放运动数据集进行实验研究,结果表明 MRIC 在多样性、行为真实性和分布真实性方面超过了最先进的基线,在某些关键指标(如碰撞率、minSADE 和时间到碰撞 JSD)上具有很大的优势(e.g., collision rate, minSADE, and time-to-collision JSD)。
URL
https://arxiv.org/abs/2404.18464