Paper Reading AI Learner

CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

2025-12-11 18:44:44
Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, Josiah Hester

Abstract

Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.

Abstract (translated)

社会存在感是共同观看内容时享受体验的核心,然而现代媒体消费越来越倾向于个人化。我们研究了多代理对话人工智能系统是否能够重现不同类型内容的共享观看经历动态效果。我们介绍了CompanionCast,这是一个通过使用多媒体输入、语音合成和空间音频来响应视频内容的多个角色专业化的AI代理的通用框架。 独特的是,CompanionCast整合了一个LLM(大语言模型)作为裁判模块,该模块会迭代地评估并改进对话在五个维度上的表现(相关性、真实性、参与度、多样性以及个性一致性)。我们通过体育观看验证了这个框架的有效性——这是一个拥有丰富动态特性和强大社会传统的领域。一项针对足球迷的小规模研究表明,在与多代理系统互动时,人们感知到的社会存在感比独自观看要高。 我们的贡献包括: 1. 一种围绕多媒体视频内容组织多代理对话的通用化框架; 2. 对话质量控制的新颖评估者-代理管道; 3. 探索性证据表明AI中介的共观体验能够增强社会存在感。 我们讨论了将这种方法应用于娱乐、教育和协作观看等多样化场景中的挑战以及未来的发展方向。

URL

https://arxiv.org/abs/2512.10918

PDF

https://arxiv.org/pdf/2512.10918.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot