Abstract
Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.
Abstract (translated)
定性研究通常包含个人、情境和组织细节,如果不适当处理,则会带来隐私风险。手动匿名化耗时且不一致,并常常遗漏关键标识符。现有的自动化工具往往依赖于模式匹配或固定规则,这些方法无法捕捉上下文信息,可能会改变数据的意义。本研究利用本地大型语言模型(LLM)构建了一个可靠、可重复且理解上下文的匿名化过程,用于检测和匿名处理定性转录中的敏感数据。我们引入了适应性匿名器结构框架(SFAA),该框架包含三个步骤:检测、分类和自适应匿名化。SFAA结合了四种匿名策略:基于规则替换、上下文感知重写、概括和抑制,这些策略根据标识符类型和风险水平应用。SFAA处理的标识符由主要国际隐私和研究伦理标准指导,包括GDPR(通用数据保护条例)、HIPAA(健康保险流通与责任法案)和OECD(经济合作与发展组织)指南。 本研究采用了结合手动和LLM辅助处理的双重方法评估方式,并使用两个案例研究来支持评估。第一个案例包括82次面对面访谈,涉及组织中的游戏化;第二个案例则有93次由机器引导的访谈,利用人工智能面试官测试LLM对工作场所隐私的认知情况。 为了评估所提出框架的效果,我们采用了两种本地模型:LLaMA和Phi进行实验。结果表明,这些大型语言模型发现的敏感数据比人工审查员多。在寻找敏感数据方面,Phi优于LLaMA,但错误稍多一些。Phi能够找到超过91%的敏感数据,并且有94.8%的数据保持了与原文相同的情感基调,这意味着其准确性非常高,因此不会影响定性数据分析的结果。
URL
https://arxiv.org/abs/2601.14683