Abstract
Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
Abstract (translated)
之前的研究已经揭示了大型语言模型(LLM)在认知重构疗法中的潜力;然而,这些研究主要集中在文本方法上,往往忽略了非言语证据在现实生活中治疗中的重要性。为了解决这一差距,我们将基于文本的认知重构扩展到了多模态领域,引入视觉线索。具体来说,我们提出了一个新的数据集——多模态认知支持对话(M2CoSC),该数据集中每一段由GPT-4生成的对话都配有一张反映虚拟客户面部表情的图片。为了更好地模拟现实生活中的心理治疗,在这种情境中,面部表情用于解读隐含的情感线索,我们提出了一种多层次的心理治疗方法,这种方法能够明确识别并整合细微证据。我们的全面实验包括LLM和视觉语言模型(VLM),结果表明,使用M2CoSC数据集可以显著提升VLM作为心理咨询师的表现。此外,多层次心理治疗推理方法使VLM能提供更体贴且富有同情心的建议,优于标准提示法的效果。
URL
https://arxiv.org/abs/2502.06873