Abstract
Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich-semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
Abstract (translated)
解码自然视觉场景的大脑活动已经蓬勃发展,在单人任务和跨subject任务中都有广泛的研究,然而跨subject任务中的研究较少。在跨subject任务中重构高质量图像是一个具有挑战性的问题,由于不同受试者之间的深刻个人差异和数据注释的稀缺性。在这项工作中,我们提出了MindTuner,一种跨subject视觉解码器,利用仅1小时的fMRI训练数据实现高质量和高语义重建,并得益于人脑视觉系统中的视觉指纹现象和一种新颖的fMRI-to-text对齐范式。首先,我们在7个受试者之间预训练一个多受试者模型,然后用在新受试者上的稀缺数据对其进行微调,其中使用跳过LoRAs来学习视觉指纹。接着,我们将图像模态作为中间 pivot 模态来实现fMRI-to-text 对齐,实现令人印象深刻的fMRI-to-text检索性能,并纠正fMRI-to-图像重构,通过微调语义来修复。定性和定量的分析结果表明,MindTuner在自然场景数据集(NSD)上超越了最先进的跨subject视觉解码模型,无论是使用1小时的训练数据还是40小时的训练数据。
URL
https://arxiv.org/abs/2404.12630