Abstract
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
Abstract (translated)
计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。
URL
https://arxiv.org/abs/2505.16932