Paper Reading AI Learner

Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques

2023-11-14 17:48:19
Laida Kushnareva, Tatiana Gaintseva, German Magai, Serguei Barannikov, Dmitry Abulkhanov, Kristian Kuznetsov, Irina Piontkovskaya, Sergey Nikolenko

Abstract

Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms.

Abstract (translated)

由于自然语言生成模型的快速发展,人们越来越多地遇到可能最初由人类撰写,然后继续由大型语言模型生成的大规模语言模型的文本。检测这种文本中人类撰写的和机器生成的部分边界是一个非常有挑战性的问题,在文献中受到了很少的关注。在这项工作中,我们考虑并比较了多种不同的方法来解决这个人工文本边界检测问题,在不同类型的特征上进行了比较。我们发现,监督微调的RoBERTa模型在一般情况下对此任务表现良好,但在重要的跨领域和跨生成设置中表现不佳,表明对数据中伪特征的过度拟合。然后,我们提出了一种基于从冻语言模型嵌入中提取的特征的新型方法,能够超越人类准确水平,并显著地改善之前考虑的基线。此外,我们还对边界检测任务进行了基于干扰项的适应性分析,并分析了其行为。我们分析了一切提出的分类器在跨领域和跨模型设置中的鲁棒性,发现了可能对人工文本边界检测算法性能产生负面影响的重要数据属性。

URL

https://arxiv.org/abs/2311.08349

PDF

https://arxiv.org/pdf/2311.08349.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot