Abstract
In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potential source of parallel corpora for the Portuguese and English languages. In this article, we present the development of a parallel corpus from TDC, which is made available by CAPES under the open data initiative. Approximately 240,000 documents were collected and aligned using the Hunalign tool. We demonstrate the capability of our developed corpus by training Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for both language directions, followed by a comparison with Google Translate (GT). Both translation models presented better BLEU scores than GT, with NMT system being the most accurate one. Sentence alignment was also manually evaluated, presenting an average of 82.30% correctly aligned sentences. Our parallel corpus is freely available in TMX format, with complementary information regarding document metadata
Abstract (translated)
在巴西,负责监督和协调研究生课程、CAPE的政府机构保存了该国所有论文和论文的记录。有关这些文件的信息可以在线访问论文和论文目录(TDC),其中包含葡萄牙语和英语的摘要以及其他元数据。因此,该数据库可能是葡萄牙语和英语语言的并行语料库的潜在来源。在本文中,我们介绍了一个来自TDC的并行语料库的开发,该语料库由开放数据倡议下的资本支出提供。使用hunlaign工具收集和整理了大约240000份文档。我们展示了我们开发的语料库的能力,通过训练统计机器翻译(SMT)和神经机器翻译(NMT)模型的两个语言方向,然后与谷歌翻译(GT)进行比较。两种翻译模型的BLeu评分均优于GT,其中NMT系统最为准确。句子对齐也被手动评估,平均有82.30%的句子正确对齐。我们的并行语料库以tmx格式免费提供,附带有关文档元数据的补充信息。
URL
https://arxiv.org/abs/1905.01715