DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints

Abstract
Abstract (translated)
URL
PDF

Abstract

Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting.

Abstract (translated)

你是否曾经因为 SE(3) 蛋白质结构建模的复杂性和计算成本而困扰,并且被语言建模的简洁性和力量所惊艳?最近的研究表明,通过将蛋白质角度序列作为序列编码,可以简化蛋白质结构。因此,语言模型可以用于无约束蛋白质主干生成。然而,这种简化并不适合有约束蛋白质涂色问题,因为模型需要在无遮挡结构的基础上恢复遮挡结构,这大大提高了几何约束的计算成本。为了克服这个困境,我们建议将隐藏的向量空间(ADS)附加到语言模型上,将不变的主干角度转换为相应的向量,并保留简单性,被称为 Seq2Direct编码器(Enc_{s2d})。 geometric 约束可以 efficiently 施加到新引入的方向空间上。同时,还引入了一个具有数学保证的Direct2Seq解码器(Dec_{d2s}),用于开发一个SDS模型(Enc_{s2d} + Dec_{d2s})。在条件扩散过程中,我们将SDS模型用作去噪神经网络,结果形成了一个有约束生成模型——DiffSDS。广泛的实验表明,插件式的 ADS 可以将语言模型转化为一个强大的结构模型,而不会丢失简单性。更重要的是,提出的 DiffSDS 在蛋白质涂色任务中比过去的强基准模型表现更好。

URL

https://arxiv.org/abs/2301.09642

PDF

https://arxiv.org/pdf/2301.09642.pdf