Abstract
A universal cross-lingual representation of documents is very important for many natural language processing tasks. In this paper, we present a document vectorization method which can effectively create document vectors via self-attention mechanism using a neural machine translation (NMT) framework. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a "Neural machine Translation framework based crosslingual Document Vector with distance constraint training" (cNTDV). cNTDV is a follow-up study from our previous research on the neural machine translation framework based document vector. The cNTDV can produce the document vector from a forward-pass of the encoder with fast speed. Moreover, it is trained with a distance constraint, so that the document vector obtained from different language pair is always consistent with each other. In a cross-lingual document classification task, our cNTDV embeddings surpass the published state-of-the-art performance in the English-to-German classification test, and, to our best knowledge, it also achieves the second best performance in German-to-English classification test. Comparing to our previous research, it does not need a translator in the testing process, which makes the model faster and more convenient.
Abstract (translated)
对于许多自然语言处理任务而言,文档的通用跨语言表示非常重要。在本文中,我们提出了一种文档向量化方法,它可以使用神经机器翻译(NMT)框架通过自我关注机制有效地创建文档向量。我们的方法使用的模型可以使用与手头任务无关的并行语料库进行训练。在测试过程中,我们的方法将采用单语文档并将其转换为“基于神经机器翻译框架的跨距文档向量与距离约束训练”(cNTDV)。 cNTDV是我们之前关于基于神经机器翻译框架的文档向量的研究的后续研究。 cNTDV可以快速地从编码器的正向通过产生文档向量。此外,它是用距离约束训练的,因此从不同语言对获得的文档向量总是彼此一致。在跨语言文档分类任务中,我们的cNTDV嵌入在英语 - 德语分类测试中超越了已发布的最先进的性能,并且据我们所知,它在德语中也达到了第二好的性能 - 英语分类测试。与我们之前的研究相比,它在测试过程中不需要翻译器,这使得模型更快更方便。
URL
https://arxiv.org/abs/1807.11057