Paper Reading AI Learner

Producing Corpora of Medieval and Premodern Occitan

2019-04-26 12:55:03
Jean-Baptiste Camps (CJM), Gilles Guilhem Couffignal (PLH)

Abstract

At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit them.Open optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and enriched.The enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern Occitan.These techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and reusable.By exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based.

Abstract (translated)

在数字语料库、版本或图书馆的帮助下,数据挖掘工具或深度学习方法的发展使得研究人员能够建立一个专门为他们的研究定制的研究语料库,丰富他们的数据并加以利用。开放的光学特性识别(OCR)工具可以适应旧的印刷品、不可编辑的文字,甚至是手稿,并有可用的结果,允许快速创建文本语料库。培训和纠正阶段的交替使得通过快速积累原始文本数据来提高结果的质量成为可能。然后可以对这些内容进行结构化,例如XML/TEI,并对其进行丰富化。通过图形或语言注释对文本进行丰富化也可以实现自动化。这些过程为语言学家所知,也为现代语言所用,在某种程度上由于缺乏足够大的引理化语料库,给诸如中世纪奥契丹语这样的语言带来了困难。提出了创造适合古代语言拼写差异较大的工具的建议,以及中世纪和前现代奥契丹语的柠檬化实验,这些技术为许多开发开辟了道路。如果每个人都不遗余力地使自己的数据在网上自由可用并可重复使用,那么,人们对高质量文本和数据数量的期望值的增加使得改进数字文献学方法成为可能。GY可以提供给奥克西坦领域的研究人员,同时回顾这些实践所基于的伦理问题。

URL

https://arxiv.org/abs/1904.11815

PDF

https://arxiv.org/pdf/1904.11815.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot