Paper Reading AI Learner

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

2024-10-25 10:28:26
Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, Ke Liu, Yu Zhang

Abstract

Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128\% improvement in SSIM and an 81\% improvement in spatiotemporal metrics. Our project is available at this https URL}{this https URL.

Abstract (translated)

从非侵入性脑活动fMRI重建静态视觉刺激取得了巨大的成功,这得益于先进的深度学习模型如CLIP和Stable Diffusion。然而,由于解码连续视觉体验的时空感知极其具有挑战性,因此基于fMRI的视频重建研究仍然有限。我们认为解决这些挑战的关键在于准确地解码大脑对视频刺激反应时所感知到的高层次语义和低层次知觉流。为此,我们提出了NeuroClips,一个创新框架用于从fMRI中解码高保真且流畅的视频。NeuroClips利用语义重构器重建视频关键帧,指导语义准确性和一致性,并使用知觉重构器捕获低层次感知细节,确保视频的平滑性。在推理过程中,它采用了一种预先训练好的T2V扩散模型,该模型融合了关键帧和低层次知觉流来进行视频重建。在公开可用的fMRI-视频数据集上评估时,NeuroClips实现了高达6秒、8FPS的高保真流畅视频重建,在各种指标中比现有最先进的模型有了显著改进,例如SSIM提高了128%,时空度量提升了81%。我们的项目可以在这个https URL访问。

URL

https://arxiv.org/abs/2410.19452

PDF

https://arxiv.org/pdf/2410.19452.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot