Paper Reading AI Learner

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

2024-04-19 05:12:04
Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Ke Liu, Liang Hu, Duoqian Miao

Abstract

Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich-semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.

Abstract (translated)

解码自然视觉场景的大脑活动已经蓬勃发展,在单人任务和跨subject任务中都有广泛的研究,然而跨subject任务中的研究较少。在跨subject任务中重构高质量图像是一个具有挑战性的问题,由于不同受试者之间的深刻个人差异和数据注释的稀缺性。在这项工作中,我们提出了MindTuner,一种跨subject视觉解码器,利用仅1小时的fMRI训练数据实现高质量和高语义重建,并得益于人脑视觉系统中的视觉指纹现象和一种新颖的fMRI-to-text对齐范式。首先,我们在7个受试者之间预训练一个多受试者模型,然后用在新受试者上的稀缺数据对其进行微调,其中使用跳过LoRAs来学习视觉指纹。接着,我们将图像模态作为中间 pivot 模态来实现fMRI-to-text 对齐,实现令人印象深刻的fMRI-to-text检索性能,并纠正fMRI-to-图像重构,通过微调语义来修复。定性和定量的分析结果表明,MindTuner在自然场景数据集(NSD)上超越了最先进的跨subject视觉解码模型,无论是使用1小时的训练数据还是40小时的训练数据。

URL

https://arxiv.org/abs/2404.12630

PDF

https://arxiv.org/pdf/2404.12630.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot