Abstract
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
Abstract (translated)
学习多代理之间的交互式运动行为是自动驾驶领域的一个核心挑战。虽然模仿学习模型能够生成现实的轨迹,但它们通常会从以安全演示为主的数据集中继承偏差,这限制了在安全性关键情况下表现的鲁棒性。此外,大多数研究依赖于开环评估方法,忽略了闭环执行中的累积误差问题。 为了解决这些局限性,我们采用了两种互补策略。首先,我们提出了组相对行为优化(GRBO),这是一种强化学习后期训练方法,通过组间的相对优势最大化以及人类规范化的手段来微调预训练的行为模型。使用仅10%的训练数据集,GRBO在保持行为真实性的同时,将安全性表现提高了超过40%。 其次,我们引入了Warm-K策略,这是一个带有热启动的Top-K采样方法,能够平衡运动选择的一致性和多样性。基于我们的Warm-K测试时间缩放法,在不重新进行训练的情况下,能够在测试时提升行为一致性与响应性,并且能缓解协变量变化和减少性能差异。 演示视频可在补充材料中查看。
URL
https://arxiv.org/abs/2512.13262