Abstract
In this paper we compare structure of Czech word embeddings for English-Czech neural machine translation (NMT), word2vec and sentiment analysis. We show that although it is possible to successfully predict part of speech (POS) tags from word embeddings of word2vec and various translation models, not all of the embedding spaces show the same structure. The information about POS is present in word2vec embeddings, but the high degree of organization by POS in the NMT decoder suggests that this information is more important for machine translation and therefore the NMT model represents it in more direct way. Our method is based on correlation of principal component analysis (PCA) dimensions with categorical linguistic data. We also show that further examining histograms of classes along the principal component is important to understand the structure of representation of information in embeddings.
Abstract (translated)
本文比较了捷克捷克神经机器翻译(NMT)、Word2VEC和情感分析中的Word嵌入结构。我们表明,虽然可以从word2vec的单词嵌入和各种翻译模型中成功地预测部分语音(pos)标记,但并非所有的嵌入空间都具有相同的结构。关于pos的信息存在于word2vec嵌入中,但由于pos在nmt解码器中的高度组织性,表明这种信息对于机器翻译更为重要,因此nmt模型更直接地表示出来。我们的方法是基于主成分分析(PCA)维度与分类语言数据的相关性。我们还表明,进一步检查沿着主成分的类的柱状图对于理解嵌入中信息的表示结构是很重要的。
URL
https://arxiv.org/abs/1906.00114