Paper Reading AI Learner

A Parallel Corpus of Theses and Dissertations Abstracts

2019-05-05 16:53:03
Felipe Soares, Gabrielli Harumi Yamashita, Michel Jose Anzanello

Abstract

In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potential source of parallel corpora for the Portuguese and English languages. In this article, we present the development of a parallel corpus from TDC, which is made available by CAPES under the open data initiative. Approximately 240,000 documents were collected and aligned using the Hunalign tool. We demonstrate the capability of our developed corpus by training Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for both language directions, followed by a comparison with Google Translate (GT). Both translation models presented better BLEU scores than GT, with NMT system being the most accurate one. Sentence alignment was also manually evaluated, presenting an average of 82.30% correctly aligned sentences. Our parallel corpus is freely available in TMX format, with complementary information regarding document metadata

Abstract (translated)

在巴西,负责监督和协调研究生课程、CAPE的政府机构保存了该国所有论文和论文的记录。有关这些文件的信息可以在线访问论文和论文目录(TDC),其中包含葡萄牙语和英语的摘要以及其他元数据。因此,该数据库可能是葡萄牙语和英语语言的并行语料库的潜在来源。在本文中,我们介绍了一个来自TDC的并行语料库的开发,该语料库由开放数据倡议下的资本支出提供。使用hunlaign工具收集和整理了大约240000份文档。我们展示了我们开发的语料库的能力,通过训练统计机器翻译(SMT)和神经机器翻译(NMT)模型的两个语言方向,然后与谷歌翻译(GT)进行比较。两种翻译模型的BLeu评分均优于GT,其中NMT系统最为准确。句子对齐也被手动评估,平均有82.30%的句子正确对齐。我们的并行语料库以tmx格式免费提供,附带有关文档元数据的补充信息。

URL

https://arxiv.org/abs/1905.01715

PDF

https://arxiv.org/pdf/1905.01715.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot