Paper Reading AI Learner

TextMachina: Seamless Generation of Machine-Generated Text Datasets

2024-01-08 15:05:32
Areg Mikael Sarvazyan, José Ángel González, Marc Franco-Salvador

Abstract

Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.

Abstract (translated)

近年来,大型语言模型(LLMs)的进步导致了高质量的机器生成文本(MGT),为无数新的用例和应用提供了可能。然而,由于滥用,轻松访问LLMs也带来了新的挑战。为了应对恶意使用,研究人员已经发布了用于有效训练与MGT相关的任务的 datasets。类似地,用于构建这些数据集的工具,但目前尚无工具能够统一它们。在这种情况下,我们介绍了TextMachina,一个模块化且可扩展的Python框架,旨在帮助创建高质量、无偏的 datasets,以构建 robust 模型,例如检测、归因或边界检测。它提供了一个用户友好的管道,抽象了构建MGT数据集的固有复杂性,例如LLM集成、提示模板化和偏差缓解。TextMachina生成的数据集的质量已在之前的 works中被评估,包括由超过100个团队共同训练的 robust MGT 检测器。

URL

https://arxiv.org/abs/2401.03946

PDF

https://arxiv.org/pdf/2401.03946.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot