Paper Reading AI Learner

Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning

2025-02-08 07:32:48
Subin Kim, Hoonrae Kim, Heejin Do, Gary Geunbae Lee

Abstract

Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.

Abstract (translated)

之前的研究已经揭示了大型语言模型(LLM)在认知重构疗法中的潜力;然而,这些研究主要集中在文本方法上,往往忽略了非言语证据在现实生活中治疗中的重要性。为了解决这一差距,我们将基于文本的认知重构扩展到了多模态领域,引入视觉线索。具体来说,我们提出了一个新的数据集——多模态认知支持对话(M2CoSC),该数据集中每一段由GPT-4生成的对话都配有一张反映虚拟客户面部表情的图片。为了更好地模拟现实生活中的心理治疗,在这种情境中,面部表情用于解读隐含的情感线索,我们提出了一种多层次的心理治疗方法,这种方法能够明确识别并整合细微证据。我们的全面实验包括LLM和视觉语言模型(VLM),结果表明,使用M2CoSC数据集可以显著提升VLM作为心理咨询师的表现。此外,多层次心理治疗推理方法使VLM能提供更体贴且富有同情心的建议,优于标准提示法的效果。

URL

https://arxiv.org/abs/2502.06873

PDF

https://arxiv.org/pdf/2502.06873.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot