Contextual Compositionality Detection with External Knowledge Bases andWord Embeddings

Abstract
Abstract (translated)
URL
PDF

Abstract

When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase `green card' is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.

Abstract (translated)

当一个短语的意思不能从它的词（如热狗）的个别含义中推断出来时，这个短语就被认为是非成分的。多词短语的自动组合性检测在任何语义处理应用中都是至关重要的，例如搜索引擎；如果不检测非组合性短语，则会显著损害系统的有效性。现有的研究以确定性的方式将短语视为复合或非复合。本文将复合性是语境而不是确定性的观点具体化，即一个短语是复合性还是非复合性取决于它的语境。例如，当提及绿色卡时，“绿卡”一词是复合词，而当表示永久居留授权时，它是非复合词。我们解决了检测这种上下文组合性的挑战，如下所示：给定一个多词短语，我们用它的全局上下文（它经常与之搭配的术语）以及它的本地上下文（使用该短语的叙述，我们称之为使用场景）的证据丰富了表示其语义的单词嵌入。我们用从外部知识库中提取的信息进一步扩展了这种表示。由此产生的表示结合了本地化上下文和更普遍的短语用法，并允许以非确定性和上下文方式检测其组合性。通过众包上下文组合性评估手动收集的短语组合性数据集对我们模型的实证评估表明，我们的模型在检测短语组合性方面优于最先进的基线。

URL

https://arxiv.org/abs/1903.08389

PDF

https://arxiv.org/pdf/1903.08389.pdf