Paper Reading AI Learner

Zipfian Whitening

2024-11-01 15:40:19
Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira

Abstract

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Abstract (translated)

神经模型中的词嵌入空间存在偏差,纠正这种偏差可以提升任务性能。我们指出,大多数用于建模、修正和测量嵌入空间对称性的方法都隐含地假设词频是均匀分布的;实际上,词频遵循高度非均匀的分布,即齐夫定律。令人惊讶的是,仅仅通过根据遵循齐夫定律的经验词频进行加权PCA白化处理就能显著提升任务性能,并超越现有的基准。从理论角度来看,我们的方法和现有方法都可以清晰地分类:词表示是根据具有均匀或齐夫基数测度的指数族分布的。采用后一种方法可以自然地强调低频但信息量大的词,在向量范数的角度上变得明显,并且在不平衡分类的损失函数方面也是如此。此外,我们的理论还证实了流行的自然语言处理方法(如跳字负采样、白化BERT和无头语言模型)之所以表现良好,是因为它们的词嵌入将经验词频编码到底层概率模型中。

URL

https://arxiv.org/abs/2411.00680

PDF

https://arxiv.org/pdf/2411.00680.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot