Paper Reading AI Learner

CAPID: Context-Aware PII Detection for Question-Answering Systems

2026-02-10 18:41:31
Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D. B. Emerson, Shubhankar Mohapatra, Xi He

Abstract

Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.

Abstract (translated)

在用户查询中检测个人可识别信息(PII)对于确保问答系统中的隐私至关重要。目前的方法主要是在忽略某些PII可能对用户的提问具有上下文相关性的情况下对其进行屏蔽,这导致了响应质量的下降。大型语言模型(LLM)或许能够帮助判断哪些PII是相关的,但由于它们封闭源代码性质和缺乏隐私保障,这些模型不适合处理敏感数据。为了实现保护隐私的PII检测,我们提出了CAPID,一种实用的方法,通过微调本地拥有的小型语言模型(SLM),在将信息传递给LLMs进行问答之前过滤掉敏感信息。 然而,现有的数据集没有捕捉到训练此类模型所需的上下文依赖的相关性。为了解决这一问题,我们提出了一种合成数据生成管道,利用大型语言模型产生一个多样且领域丰富的数据集,涵盖多种PII类型和相关程度级别。使用这个数据集,我们将SLM微调以检测PII片段、分类其类型并估计上下文相关性。 我们的实验显示,基于微调后的SLM的相关性感知PII检测在片段准确性、相关性和类型准确性方面显著优于现有的基线模型,并且在匿名化后保持了更高的下游效用。

URL

https://arxiv.org/abs/2602.10074

PDF

https://arxiv.org/pdf/2602.10074.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot