Paper Reading AI Learner

Leveraging Data Augmentation for Process Information Extraction

2024-04-11 06:32:03
Julian Neuberger, Leonie Doll, Benedict Engelmann, Lars Ackermann, Stefan Jablonski

Abstract

Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation of process models from readily available data. These include process mining on event logs, and generating business process models from natural language texts. Research in the latter field is regularly faced with the problem of limited data availability, hindering both evaluation and development of new techniques, especially learning-based ones. To overcome this data scarcity issue, in this paper we investigate the application of data augmentation for natural language text data. Data augmentation methods are well established in machine learning for creating new, synthetic data without human assistance. We find that many of these methods are applicable to the task of business process information extraction, improving the accuracy of extraction. Our study shows, that data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text, where currently mostly rule-based systems are still state of the art. Simple data augmentation techniques improved the $F_1$ score of mention extraction by 2.9 percentage points, and the $F_1$ of relation extraction by $4.5$. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.

Abstract (translated)

商业流程建模项目通常需要形式化过程模型作为核心组件。由于创建这样的形式化过程模型的成本很高,许多不同的研究领域都致力于从可用的数据中自动生成过程模型,这些领域包括事件日志中的过程挖掘和自然语言文本中的业务流程建模。后者领域的研究经常会面临数据可用性有限的问题,这阻碍了评估和开发新技术,特别是基于学习的技术。为了克服这一数据稀缺性问题,本文研究了在自然语言文本数据中应用数据增强的方法。数据增强方法在机器学习领域已经过时,但是已经建立了许多可以创建新、合成数据的方法,而无需人类帮助。我们发现,许多这些方法都适用于从自然语言文本中提取业务流程信息,提高提取的准确性。我们的研究表明,数据增强是使机器学习方法为从自然语言文本中生成业务流程模型的关键组成部分,而目前仍然主要是基于规则的系统。简单的数据增强技术将提及提取的$F_1$分数提高了2.9个百分点,而关系提取的$F_1$分数则提高了$4.5$个百分点。为了更好地了解数据增强如何改变人类标注文本,我们分析了生成的文本,并可视化和讨论了增强文本数据的属性。我们将所有代码和实验结果公开发布。

URL

https://arxiv.org/abs/2404.07501

PDF

https://arxiv.org/pdf/2404.07501.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot