Paper Reading AI Learner

Resa: Transparent Reasoning Models via SAEs

2025-06-11 17:44:01
Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger

Abstract

How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

Abstract (translated)

我们如何通过利用语言模型的底层表示来有效地激发其推理能力?我们用Resa回答了这个问题,这是一个通过一种新颖且高效的稀疏自编码器调优(SAE-Tuning)过程训练的一系列15亿参数规模的推理模型。该方法首先使用源模型数据训练一个稀疏自编码器以捕捉推理能力,然后利用训练好的自编码器指导标准监督微调过程,在目标模型中激发这些能力,整个过程仅需经过验证的问题-答案数据而无需任何推理痕迹。 值得注意的是,在某些基础模型上应用SAE-Tuning,并在进一步的强化学习后训练之前进行处理时,它能够保留超过97%的与之相对应的强化学习训练后的推理性能,同时将训练成本减少超过2000倍至大约1美元,并将训练时间减少超过450倍至约20分钟。此外,在对经过轻度强化学习训练的模型(例如在两块GPU上花费不到一小时)应用时,它能够实现如AIME24中Pass@1得分为43.33%,AMC23中Pass@1得分为90%等推理性能,并且仅增加约1美元的成本。 令人惊讶的是,通过自编码器提取的推理能力可能是通用和模块化的。泛化意味着从一个数据集中提取的能力仍然可以提升更大、更相关数据集上的表现;而模块化则表示可以从Qwen或Qwen-Math中抽取的能力可以在测试时附加到R1-Distill模型上,并且无需任何重新训练即可获得类似的效果。 一系列消融实验验证了上述发现,所有相关成果均已完全开源。

URL

https://arxiv.org/abs/2506.09967

PDF

https://arxiv.org/pdf/2506.09967.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot