Paper Reading AI Learner

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

2024-04-24 09:30:00
Jacob Pfau, William Merrill, Samuel R. Bowman

Abstract

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

Abstract (translated)

翻译:通过语言模型的连续思考回答可以提高大多数基准测试的性能。然而,尚不清楚这些性能提升是否可以归因于人机类似任务的分层、或是说增加的标记允许的更广泛的计算。我们证明了,Transformer 可以在没有中间标记的情况下,使用无意义的填充标记(例如,'......')代替连续思考来解决它们无法解决的两个困难算法任务。然而,我们通过经验发现,学会使用填充标记是困难的,并且需要特定的、密集的监督来收敛。我们还给出了关于一类问题中填充标记有益于什么的问题的 theoretical 描述,即第一个度量深度的公式。对于满足这种描述的问题,连续思考标记不必提供关于多标记计算中中间计算步骤的信息。总之,我们的结果表明,无需选择标记,额外的标记可以提供计算优势。由于中间标记可以作为填充标记发挥作用,这引发了对大型语言模型从事未经审计、隐藏计算的担忧。

URL

https://arxiv.org/abs/2404.15758

PDF

https://arxiv.org/pdf/2404.15758.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot