Abstract
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.
Abstract (translated)
神经模型中的词嵌入空间存在偏差,纠正这种偏差可以提升任务性能。我们指出,大多数用于建模、修正和测量嵌入空间对称性的方法都隐含地假设词频是均匀分布的;实际上,词频遵循高度非均匀的分布,即齐夫定律。令人惊讶的是,仅仅通过根据遵循齐夫定律的经验词频进行加权PCA白化处理就能显著提升任务性能,并超越现有的基准。从理论角度来看,我们的方法和现有方法都可以清晰地分类:词表示是根据具有均匀或齐夫基数测度的指数族分布的。采用后一种方法可以自然地强调低频但信息量大的词,在向量范数的角度上变得明显,并且在不平衡分类的损失函数方面也是如此。此外,我们的理论还证实了流行的自然语言处理方法(如跳字负采样、白化BERT和无头语言模型)之所以表现良好,是因为它们的词嵌入将经验词频编码到底层概率模型中。
URL
https://arxiv.org/abs/2411.00680