Paper Reading AI Learner

Unlocking Post-hoc Dataset Inference with Synthetic Data

2025-06-18 08:46:59
Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

Abstract

The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at this https URL.

Abstract (translated)

大型语言模型(LLMs)的显著能力主要归功于它们庞大的训练数据集,这些数据集通常从互联网上抓取而来,并且往往没有尊重数据所有者的知识产权。数据推断(DI)通过识别受质疑的数据集是否在训练中被使用,为这个问题提供了一种潜在的解决方案,从而使得数据所有者能够验证未经授权的使用情况。然而,现有的DI方法需要一个与受损数据集分布密切匹配但未参与训练的私有测试集。这种符合原分布但在实践中很少可用的保留数据极大地限制了DI的应用性。 在这项工作中,我们通过合成生成所需的保留集来解决这一挑战。我们的方法解决了两个关键障碍:(1) 创建高质量且多样化的合成数据以准确反映原始分布,我们通过在基于后缀的完成任务上训练的数据生成器实现这一点;(2) 桥接真实与合成数据之间可能性差距,这是通过事后校准来实现的。 通过对多种文本数据集进行广泛的实验显示,使用我们的生成数据作为保留集使DI能够在保持低误报率的同时,以高置信度检测原始训练集。这一结果赋予版权所有者在数据使用方面的合法主张,并展示了我们方法在现实世界诉讼中的可靠性。我们的代码可在此网址获取。

URL

https://arxiv.org/abs/2506.15271

PDF

https://arxiv.org/pdf/2506.15271.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot