Leveraging Data Augmentation for Process Information Extraction

Abstract
Abstract (translated)
URL
PDF

Abstract

Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation of process models from readily available data. These include process mining on event logs, and generating business process models from natural language texts. Research in the latter field is regularly faced with the problem of limited data availability, hindering both evaluation and development of new techniques, especially learning-based ones. To overcome this data scarcity issue, in this paper we investigate the application of data augmentation for natural language text data. Data augmentation methods are well established in machine learning for creating new, synthetic data without human assistance. We find that many of these methods are applicable to the task of business process information extraction, improving the accuracy of extraction. Our study shows, that data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text, where currently mostly rule-based systems are still state of the art. Simple data augmentation techniques improved the $F_1$ score of mention extraction by 2.9 percentage points, and the $F_1$ of relation extraction by $4.5$. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.

Abstract (translated)

商业流程建模项目通常需要形式化过程模型作为核心组件。由于创建这样的形式化过程模型的成本很高，许多不同的研究领域都致力于从可用的数据中自动生成过程模型，这些领域包括事件日志中的过程挖掘和自然语言文本中的业务流程建模。后者领域的研究经常会面临数据可用性有限的问题，这阻碍了评估和开发新技术，特别是基于学习的技术。为了克服这一数据稀缺性问题，本文研究了在自然语言文本数据中应用数据增强的方法。数据增强方法在机器学习领域已经过时，但是已经建立了许多可以创建新、合成数据的方法，而无需人类帮助。我们发现，许多这些方法都适用于从自然语言文本中提取业务流程信息，提高提取的准确性。我们的研究表明，数据增强是使机器学习方法为从自然语言文本中生成业务流程模型的关键组成部分，而目前仍然主要是基于规则的系统。简单的数据增强技术将提及提取的$F_1$分数提高了2.9个百分点，而关系提取的$F_1$分数则提高了$4.5$个百分点。为了更好地了解数据增强如何改变人类标注文本，我们分析了生成的文本，并可视化和讨论了增强文本数据的属性。我们将所有代码和实验结果公开发布。

URL

https://arxiv.org/abs/2404.07501

PDF

https://arxiv.org/pdf/2404.07501.pdf

Leveraging Data Augmentation for Process Information Extraction

Abstract

Abstract (translated)

URL

PDF Copy

PDF