Abstract
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at this https URL.
Abstract (translated)
大型语言模型(LLMs)的显著能力主要归功于它们庞大的训练数据集,这些数据集通常从互联网上抓取而来,并且往往没有尊重数据所有者的知识产权。数据推断(DI)通过识别受质疑的数据集是否在训练中被使用,为这个问题提供了一种潜在的解决方案,从而使得数据所有者能够验证未经授权的使用情况。然而,现有的DI方法需要一个与受损数据集分布密切匹配但未参与训练的私有测试集。这种符合原分布但在实践中很少可用的保留数据极大地限制了DI的应用性。 在这项工作中,我们通过合成生成所需的保留集来解决这一挑战。我们的方法解决了两个关键障碍:(1) 创建高质量且多样化的合成数据以准确反映原始分布,我们通过在基于后缀的完成任务上训练的数据生成器实现这一点;(2) 桥接真实与合成数据之间可能性差距,这是通过事后校准来实现的。 通过对多种文本数据集进行广泛的实验显示,使用我们的生成数据作为保留集使DI能够在保持低误报率的同时,以高置信度检测原始训练集。这一结果赋予版权所有者在数据使用方面的合法主张,并展示了我们方法在现实世界诉讼中的可靠性。我们的代码可在此网址获取。
URL
https://arxiv.org/abs/2506.15271