Paper Reading AI Learner

InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy

2025-06-30 18:00:41
Vishnu Vinod, Krishna Pillutla, Abhradeep Guha Thakurta

Abstract

As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than $10\times$ the computation cost of non-private generation.

Abstract (translated)

随着基于大型语言模型(LLM)的长文本生成技术取得了重大进展,诸如检索增强生成(RAG)和推理时间缩放等范式也随之出现。然而,在生成过程中安全地整合私有信息仍然是一个关键且未解决的问题。我们提出了InvisibleInk,这是一个高度可扩展的长文本生成框架,能够对敏感参考数据提供严格的差分隐私保证。 InvisibleInk通过两个创新方法解释了从LLM下一个标记分布中采样的过程:首先,将差分隐私机制应用于模型的对数(logits)上。具体来说,我们通过对私有信息进行隔离和裁剪来减少隐私成本,这些信息相对于公共信息而言是敏感的。其次,通过从一个包含前$k$个私有标记的小超集里采样,提高了文本质量。 实证评估表明,在生成具有相同效用的长篇私人文本时,与最先进的基线方法相比,InvisibleInk在各个隐私级别上将计算成本减少了8倍。总而言之,InvisibleInk能够在不到非私有生成10倍的计算成本的情况下生成私有长文本。

URL

https://arxiv.org/abs/2507.02974

PDF

https://arxiv.org/pdf/2507.02974.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot