Abstract
Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.
Abstract (translated)
近年来,大型语言模型(LLMs)的进步导致了高质量的机器生成文本(MGT),为无数新的用例和应用提供了可能。然而,由于滥用,轻松访问LLMs也带来了新的挑战。为了应对恶意使用,研究人员已经发布了用于有效训练与MGT相关的任务的 datasets。类似地,用于构建这些数据集的工具,但目前尚无工具能够统一它们。在这种情况下,我们介绍了TextMachina,一个模块化且可扩展的Python框架,旨在帮助创建高质量、无偏的 datasets,以构建 robust 模型,例如检测、归因或边界检测。它提供了一个用户友好的管道,抽象了构建MGT数据集的固有复杂性,例如LLM集成、提示模板化和偏差缓解。TextMachina生成的数据集的质量已在之前的 works中被评估,包括由超过100个团队共同训练的 robust MGT 检测器。
URL
https://arxiv.org/abs/2401.03946