Abstract
We introduce AlphaGrad, a memory-efficient, conditionally stateless optimizer addressing the memory overhead and hyperparameter complexity of adaptive methods like Adam. AlphaGrad enforces scale invariance via tensor-wise L2 gradient normalization followed by a smooth hyperbolic tangent transformation, g′=tanh(α⋅˜g), controlled by a single steepness parameter α. Our contributions include: (1) the AlphaGrad algorithm formulation; (2) a formal non-convex convergence analysis guaranteeing stationarity; (3) extensive empirical evaluation on diverse RL benchmarks (DQN, TD3, PPO). Compared to Adam, AlphaGrad demonstrates a highly context-dependent performance profile. While exhibiting instability in off-policy DQN, it provides enhanced training stability with competitive results in TD3 (requiring careful α tuning) and achieves substantially superior performance in on-policy PPO. These results underscore the critical importance of empirical α selection, revealing strong interactions between the optimizer's dynamics and the underlying RL algorithm. AlphaGrad presents a compelling alternative optimizer for memory-constrained scenarios and shows significant promise for on-policy learning regimes where its stability and efficiency advantages can be particularly impactful.
Abstract (translated)
我们介绍了AlphaGrad,这是一种内存高效且条件无状态的优化器,旨在解决类似Adam这样的自适应方法所带来的内存开销和超参数复杂性问题。AlphaGrad通过逐张量L2梯度归一化后进行平滑双曲正切变换 g′=tanh(α⋅˜g) 来强制执行尺度不变性,该变换由单一陡峭程度参数 α 控制。 我们的贡献包括: 1. AlphaGrad算法的公式推导; 2. 非凸收敛分析的形式化证明,保证了稳定性的达成; 3. 在多种强化学习基准测试(DQN、TD3、PPO)上进行了详尽的经验评估。 与Adam相比,AlphaGrad展示了高度依赖于上下文的表现特征。虽然在基于策略外的DQN中表现出不稳定,但它为TD3提供了增强的训练稳定性,并取得了具有竞争力的结果(需要仔细调整 α 参数),并在基于策略内的PPO中实现了显著优越的性能表现。这些结果突显了经验性选择 α 的重要性,并揭示了优化器动力学与底层强化学习算法之间的强相互作用。 AlphaGrad为内存受限的情景提供了具有竞争力的选择,并且对于其稳定性和效率优势可以特别发挥作用的基于策略的学习场景显示出了显著潜力。
URL
https://arxiv.org/abs/2504.16020