Paper Reading AI Learner

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

2024-10-28 15:59:31
Lize Alberts, Benjamin Ellis, Andrei Lupu, Jakob Foerster

Abstract

We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

Abstract (translated)

我们引入了一个多轮基准测试,用于评估基于LLM的AI助手在处理用户提供的关键安全上下文方面是否达到个性化对齐。通过五个场景(每个场景包含337个用例)对十个领先模型进行评估后发现,在保持用户特定考量方面存在系统性不一致问题,即使是被评为“无害”的顶级模型,在给定用户提供的情境下也提出了明显有害的建议。主要失败模式包括冲突偏好权重不当、谄媚(优先考虑用户的偏好而非安全性)、忽视上下文窗口中的关键用户信息,以及在应用用户特定知识时的一致性缺失。同样的系统偏差也在OpenAI的o1中被观察到,这表明强大的推理能力并不一定能够转化为这种个性化的思考方式。我们发现,提示LLM考虑安全关键情境显著提高了性能,而通用的“无害且有用”指令则没有这样的效果。基于这些发现,我们提出了研究方向,包括将自我反思能力、在线用户建模和动态风险评估嵌入到AI助手中的建议。我们的工作强调了在设计用于持续人类交互的系统时需要采用细致入微、情境感知的方法来实现对齐,这有助于开发安全且体贴的AI助手。

URL

https://arxiv.org/abs/2410.21159

PDF

https://arxiv.org/pdf/2410.21159.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot