Paper Reading AI Learner

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

2025-05-22 17:23:14
Noah Amsel, David Persson, Christopher Musco, Robert Gower

Abstract

Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.

Abstract (translated)

计算极分解和相关的矩阵符号函数是数值分析领域中长期研究的问题。近年来,这些问题在深度学习领域变得尤为重要,特别是在Muon优化框架中的应用。然而,在这种环境中需求与传统数值分析的需求有显著不同。在深度学习中,方法必须高效且兼容GPU,并且对精度的要求往往不高。因此,传统的算法如牛顿-施瓦茨(其初期收敛速度慢)和基于有理函数的方法(依赖于QR分解或矩阵求逆)在此环境中并不适用。 在这项工作中,我们引入了一种名为Polar Express的新算法,用于在GPU环境下高效计算极分解。与经典的多项式方法(如牛顿-施瓦茨法)类似,我们的方法仅使用矩阵乘法运算,从而使其兼容于GPU环境。受到陈和周以及中村祐介和弗雷德之前工作的启发,Polar Express通过在每次迭代中解决一个最小最大优化问题来调整多项式更新规则,并证明了该算法具有强大的最坏情况下的最优性保证。这一特性确保了快速的早期收敛以及较快的渐近收敛速度。 我们还解决了有限精度的问题,使其在实际应用中能够在bfloat16格式下保持稳定。我们将Polar Express应用于Muon优化框架,在大规模模型(如GPT-2)上验证损失,并显示相对于各种学习率下的近期替代方法而言,其性能得到了一致的改进。

URL

https://arxiv.org/abs/2505.16932

PDF

https://arxiv.org/pdf/2505.16932.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot