Abstract
We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.
Abstract (translated)
我们引入了一个多轮基准测试,用于评估基于LLM的AI助手在处理用户提供的关键安全上下文方面是否达到个性化对齐。通过五个场景(每个场景包含337个用例)对十个领先模型进行评估后发现,在保持用户特定考量方面存在系统性不一致问题,即使是被评为“无害”的顶级模型,在给定用户提供的情境下也提出了明显有害的建议。主要失败模式包括冲突偏好权重不当、谄媚(优先考虑用户的偏好而非安全性)、忽视上下文窗口中的关键用户信息,以及在应用用户特定知识时的一致性缺失。同样的系统偏差也在OpenAI的o1中被观察到,这表明强大的推理能力并不一定能够转化为这种个性化的思考方式。我们发现,提示LLM考虑安全关键情境显著提高了性能,而通用的“无害且有用”指令则没有这样的效果。基于这些发现,我们提出了研究方向,包括将自我反思能力、在线用户建模和动态风险评估嵌入到AI助手中的建议。我们的工作强调了在设计用于持续人类交互的系统时需要采用细致入微、情境感知的方法来实现对齐,这有助于开发安全且体贴的AI助手。
URL
https://arxiv.org/abs/2410.21159