Brain Captioning: Decoding human brain activity into images and text

Abstract
Abstract (translated)
URL
PDF

Abstract

Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications in various fields, including neural art, style transfer, and portable devices.

Abstract (translated)

每一天，人类大脑处理巨大的视觉信息，依靠复杂的神经网络来感知和理解这些刺激。最近，功能性磁共振成像(fMRI)技术的突破使科学家能够从人类大脑活动模式中获取视觉信息。在本研究中，我们提出了一种创新的方法，将大脑活动解码为有意义的图像和标题，特别关注大脑标题解码，因为相比将大脑活动解码为图像，它更加灵活。我们利用最新的图像标题建模技术，并采用了一种独特的图像重建管道，利用潜伏扩散模型和深度估计。我们使用自然场景数据集，这是一个从COCO数据集中观看图像的8名 subjects 综合的fMRI数据集。我们使用生成图像到文本Transformer(GIT)作为标题解码的主干，并提出了基于潜伏扩散模型的新图像重建管道。方法包括训练 regularized 线性回归模型，大脑活动和提取特征之间的训练。此外，我们还将控制Net的深度地图引入，以进一步指导重建过程。我们使用定量指标对生成标题和图像都进行了评估。我们的大脑标题解码方法胜过了现有的方法，而我们的图像重建管道生成了更好的空间关系的图像。总之，我们展示了大脑解码的重大进展，展示了将视觉和语言相结合以更好地理解人类认知的巨大潜力。我们的方法为未来的研究提供了一个灵活的平台，可能有广泛的应用领域，包括神经网络艺术、风格转移和便携式设备。

URL

https://arxiv.org/abs/2305.11560

PDF

https://arxiv.org/pdf/2305.11560.pdf