Paper Reading AI Learner

When and How Unlabeled Data Provably Improve In-Context Learning

2025-06-18 10:01:17
Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury, Samet Oymak

Abstract

Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

Abstract (translated)

最近的研究表明,在上下文学习(ICL)中,即使演示中的标签缺失或错误,也能取得有效的成果。为了揭示这一能力,我们考察了一个典型场景:在这种场景下,演示数据根据二元高斯混合模型(GMM)抽取,并且其中一部分演示数据的标签是缺失的。我们提供了一项全面的理论研究,表明: 1. 一层线性注意力模型的学习损失地形能够恢复最优完全监督估计器,但完全无法利用未标记的数据。 2. 相比之下,多层或循环Transformer可以通过隐式构建形式为$\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$的估计器来有效利用未标记数据(其中$X$和$y$分别表示特征和部分观察到的标签,缺失项设为零)。 我们描述了可以作为深度函数表达的一类多项式,并且将其与期望最大化算法联系起来——这是一种在半监督学习中常用的迭代伪标注算法。值得注意的是,主导的多项式的最高幂次是指数级的,因此只需轻微增加深度或循环次数就足以实现这一效果。 作为理论应用的一部分,我们建议对现成的表格基础模型进行循环处理,以增强它们的半监督能力。在真实世界数据集上的广泛评估表明,与标准单遍推断相比,我们的方法显著提高了半监督表格学习性能。

URL

https://arxiv.org/abs/2506.15329

PDF

https://arxiv.org/pdf/2506.15329.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot