Paper Reading AI Learner

Causal Inspired Multi Modal Recommendation

2025-10-14 09:29:07
Jie Yang, Chenyang Gu, Zixuan Liu

Abstract

Multimodal recommender systems enhance personalized recommendations in e-commerce and online advertising by integrating visual, textual, and user-item interaction data. However, existing methods often overlook two critical biases: (i) modal confounding, where latent factors (e.g., brand style or product category) simultaneously drive multiple modalities and influence user preference, leading to spurious feature-preference associations; (ii) interaction bias, where genuine user preferences are mixed with noise from exposure effects and accidental clicks. To address these challenges, we propose a Causal-inspired multimodal Recommendation framework. Specifically, we introduce a dual-channel cross-modal diffusion module to identify hidden modal confounders, utilize back-door adjustment with hierarchical matching and vector-quantized codebooks to block confounding paths, and apply front-door adjustment combined with causal topology reconstruction to build a deconfounded causal subgraph. Extensive experiments on three real-world e-commerce datasets demonstrate that our method significantly outperforms state-of-the-art baselines while maintaining strong interpretability.

Abstract (translated)

多模态推荐系统通过融合视觉、文本和用户-物品交互数据,提升了电子商务和在线广告中的个性化推荐效果。然而,现有方法往往忽略了两个关键偏差:(i)模式混淆,即潜在因素(例如品牌风格或产品类别)同时驱动多个模式,并影响用户的偏好,导致虚假的特征-偏好关联;(ii)交互偏倚,即真正的用户偏好与曝光效应和意外点击产生的噪声混合在一起。为解决这些问题,我们提出了一个基于因果关系的多模态推荐框架。具体来说,我们引入了一个双通道跨模式扩散模块来识别隐藏的模式混淆因素,利用分层匹配和向量量化代码本进行背门调整以阻断混淆路径,并采用前门调整结合因果拓扑重构建立去混杂的因果子图。在三个现实世界的电子商务数据集上进行了广泛的实验,证明了我们的方法显著优于最先进的基准模型,同时保持了强大的可解释性。

URL

https://arxiv.org/abs/2510.12325

PDF

https://arxiv.org/pdf/2510.12325.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot