Abstract
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of $\textit{names}$ such as persons, locations, organizations etc. in the text. Our study shows how the presence of $\textit{name-bias}$ in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic this http URL-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose $\textit{text-anonymization}$ during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on two downstream NLP tasks, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
Abstract (translated)
文本嵌入模型经常表现出源于训练数据的偏见。本文探讨了一种此前未被研究过的文本嵌入中的偏差:即**名称(如人名、地名、组织名等)的存在所引起的偏差**。我们的研究表明,文本中存在**名称偏差**可能使基于语义内容评估主题相似性时得出错误结论。具体来说,当两个文档仅因含有相同或类似的名称而被误认为是相似的,即使它们在实际意义上没有任何关联;或者相反地,由于名称的不同而导致文本被认为不相关,尽管其含义完全一致。 我们首先展示了不同文本嵌入模型中存在的名称偏差,并提出了一种称为**文本匿名化**的方法,在推理过程中移除文本中的名称参考,同时保留文本的核心主题。我们在两个下游自然语言处理任务上验证了该匿名化方法的有效性,并取得了显著的性能提升。 我们的这种方法简单且无需额外训练优化,提供了一个实用而易于实施的方式来减轻名称偏差问题。
URL
https://arxiv.org/abs/2502.02903