Paper Reading AI Learner

MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

2025-10-09 17:42:51
Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

Abstract

Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

Abstract (translated)

现实中的视频通常会遭受各种复杂的退化问题,例如噪声、压缩伪影和低光失真,这是由于采集和传输条件的多样性所导致。现有的恢复方法一般需要专业的手动选择专门模型或依赖于单一架构,而这种架构无法在不同退化的条件下泛化。受专家经验启发,我们提出了MoA-VR(混合体代理视频恢复系统),它是第一个通过三个协调的代理模仿人类专业人员推理和处理过程的方法:退化识别、路由与恢复以及恢复质量评估。 具体来说,我们构建了一个大规模且高分辨率的视频降级识别基准,并建立了一个由视觉语言模型(VLM)驱动的降级标识器。此外,我们还引入了一种自我适应路由器,它通过观察工具使用模式来自主学习有效的恢复策略,这种路由器是由大型语言模型(LLM)驱动的。为了评估中间和最终处理视频的质量,我们构建了修复视频质量(Res-VQ)数据集,并设计了一个针对恢复任务定制的基于视觉语言模型的视频质量评估(VQA)模型。 广泛的实验表明,MoA-VR能够有效应对各种复杂退化问题,在客观指标和感知质量方面均优于现有的基准系统。这些结果突显了在通用视频恢复系统中结合多模态智能和模块化推理的巨大潜力。

URL

https://arxiv.org/abs/2510.08508

PDF

https://arxiv.org/pdf/2510.08508.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot