Abstract
As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
Abstract (translated)
随着大型语言模型融入日常生活的趋势不断上升,为解决在主观和个人困境方面进行建议的基准存在明显的差距,我们引入了AdvisorQA,第一个旨在评估LLM提供个性化关注建议能力的基准。这个论坛以动态互动的方式进行,用户发布寻求建议的问题,平均每条获得8.9条建议,获得了来自成千上用户的164.2个赞,体现了集体智能框架。因此,我们已经完成了一个涵盖日常生活问题、多样对应回答和多数投票排名的基准,以训练我们的有用性度量。基准实验证实了AdvisorQA通过我们的有用性度量GPT-4以及人类评估的有效性。 AdvisorQA标志着在提高提供个性化、富有同情心的建议的QAS系统方面取得了显著的飞跃,展示了LLM对人类主观性的更好理解。
URL
https://arxiv.org/abs/2404.11826