Abstract
Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at this https URL.
Abstract (translated)
逆向蛋白质折叠是计算蛋白质设计中的一个基本任务,旨在设计能够折叠成所需主链结构的蛋白质序列。尽管为此任务开发的机器学习算法已经取得了显著的成功,但现有的方法主要采用判别式公式,经常遇到误差累积问题,并且往往难以捕捉到大量可能的序列变体。为了解决这些问题,我们提出了Bridge-IF,一种用于逆向折叠的生成扩散桥模型,其设计目的是学习主链结构与蛋白质序列分布之间的概率依赖关系。具体而言,我们使用一个表达力强的结构编码器来提出从结构中衍生出的离散且信息丰富的先验,并建立马尔可夫桥以将此先验与原生序列连接起来。在推理阶段,Bridge-IF逐步优化先验序列,最终达到更合理的蛋白质设计。此外,我们还提出了对马尔可夫桥模型的重新参数化视角,从中推导出一个简化了的损失函数,这有利于更有效的训练。同时,我们通过调整蛋白质语言模型(PLMs)以满足结构条件来精确地近似马尔可夫桥过程,在保持高效参数训练的同时显著提升生成性能。在广泛认可的标准基准上的大量实验表明,Bridge-IF主要超越现有基线的序列恢复能力,并在设计具有高折叠性的合理蛋白质方面表现出色。代码可在以下链接获得:[此 https URL]。
URL
https://arxiv.org/abs/2411.02120