Paper Reading AI Learner

Local Language Models for Context-Aware Adaptive Anonymization of Sensitive Text

2026-01-21 05:59:56
Aisvarya Adeseye, Jouni Isoaho, Seppo Virtanen, Mohammad Tahir

Abstract

Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.

Abstract (translated)

定性研究通常包含个人、情境和组织细节,如果不适当处理,则会带来隐私风险。手动匿名化耗时且不一致,并常常遗漏关键标识符。现有的自动化工具往往依赖于模式匹配或固定规则,这些方法无法捕捉上下文信息,可能会改变数据的意义。本研究利用本地大型语言模型(LLM)构建了一个可靠、可重复且理解上下文的匿名化过程,用于检测和匿名处理定性转录中的敏感数据。我们引入了适应性匿名器结构框架(SFAA),该框架包含三个步骤:检测、分类和自适应匿名化。SFAA结合了四种匿名策略:基于规则替换、上下文感知重写、概括和抑制,这些策略根据标识符类型和风险水平应用。SFAA处理的标识符由主要国际隐私和研究伦理标准指导,包括GDPR(通用数据保护条例)、HIPAA(健康保险流通与责任法案)和OECD(经济合作与发展组织)指南。 本研究采用了结合手动和LLM辅助处理的双重方法评估方式,并使用两个案例研究来支持评估。第一个案例包括82次面对面访谈,涉及组织中的游戏化;第二个案例则有93次由机器引导的访谈,利用人工智能面试官测试LLM对工作场所隐私的认知情况。 为了评估所提出框架的效果,我们采用了两种本地模型:LLaMA和Phi进行实验。结果表明,这些大型语言模型发现的敏感数据比人工审查员多。在寻找敏感数据方面,Phi优于LLaMA,但错误稍多一些。Phi能够找到超过91%的敏感数据,并且有94.8%的数据保持了与原文相同的情感基调,这意味着其准确性非常高,因此不会影响定性数据分析的结果。

URL

https://arxiv.org/abs/2601.14683

PDF

https://arxiv.org/pdf/2601.14683.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot