Abstract
In computational linguistics, a large body of work exists on distributed modeling of lexical relations, focussing largely on lexical relations such as hypernymy (scientist -- person) that hold between two categories, as expressed by common nouns. In contrast, computational linguistics has paid little attention to entities denoted by proper nouns (Marie Curie, Mumbai, ...). These have investigated in detail by the Knowledge Representation and Semantic Web communities, but generally not with regard to their linguistic properties. Our paper closes this gap by investigating and modeling the lexical relation of instantiation, which holds between an entity-denoting and a category-denoting expression (Marie Curie -- scientist or Mumbai -- city). We present a new, principled dataset for the task of instantiation detection as well as experiments and analyses on this dataset. We obtain the following results: (a), entities belonging to one category form a region in distributional space, but the embedding for the category word is typically located outside this subspace; (b) it is easy to learn to distinguish entities from categories from distributional evidence, but due to (a), instantiation proper is much harder to learn when using common nouns as representations of categories; (c) this problem can be alleviated by using category representations based on entity rather than category word embeddings.
Abstract (translated)
在计算语言学中,词汇关系的分布式建模存在大量的工作,主要集中在词汇关系上,例如普通名词所表达的两个类别之间的上位词(科学家 - 人)。相比之下,计算语言学很少关注由专有名词(Marie Curie,Mumbai,...)表示的实体。这些已经通过知识表示和语义Web社区进行了详细调查,但通常不考虑其语言属性。 我们的论文通过调查和建模实例化的词汇关系来弥补这一差距,实例化在实体表示和类别表示表达(玛丽居里 - 科学家或孟买 - 城市)之间。我们提供了一个新的,有原则的数据集,用于实例化检测任务以及对该数据集的实验和分析。我们得到以下结果:(a),属于一个类别的实体在分布空间中形成一个区域,但类别词的嵌入通常位于该子空间之外; (b)很容易学会将实体与分类证据区分开来,但由于(a),当使用普通名词作为类别的表示时,实例化更难以学习; (c)通过使用基于实体的类别表示而不是类别词嵌入,可以缓解这个问题。
URL
https://arxiv.org/abs/1808.01662