Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

Abstract
Abstract (translated)
URL
PDF

Abstract

Protein representation learning has primarily benefited from the remarkable development of language models (LMs). Accordingly, pre-trained protein models also suffer from a problem in LMs: a lack of factual knowledge. The recent solution models the relationships between protein and associated knowledge terms as the knowledge encoding objective. However, it fails to explore the relationships at a more granular level, i.e., the token level. To mitigate this, we propose Knowledge-exploited Auto-encoder for Protein (KeAP), which performs token-level knowledge graph exploration for protein representation learning. In practice, non-masked amino acids iteratively query the associated knowledge tokens to extract and integrate helpful information for restoring masked amino acids via attention. We show that KeAP can consistently outperform the previous counterpart on 9 representative downstream applications, sometimes surpassing it by large margins. These results suggest that KeAP provides an alternative yet effective way to perform knowledge enhanced protein representation learning.

Abstract (translated)

蛋白质表示学习主要受益于语言模型(LM)的显著发展。因此,训练好的蛋白质模型也面临着LMs中存在的问题:缺乏事实知识。最近的解决方案将蛋白质及其相关的知识术语之间的关系建模为知识编码目标。然而,它未能探索到更细粒度的关系,即 token 水平。为了解决这个问题,我们提出了知识利用的自编码器(KeAP),它通过 token 水平的知识图进行蛋白质表示学习的 token 水平知识探索。在实践中,非遮蔽氨基酸通过迭代查询相关的知识token来提取和整合有用的信息,以恢复遮蔽氨基酸并通过注意力。我们证明,KeAP可以在9个代表性的后续应用中 consistently 优于之前的竞争对手,有时甚至超出其水平。这些结果表明,KeAP提供了一种有效的知识增强蛋白质表示学习的选择。

URL

https://arxiv.org/abs/2301.13154

PDF

https://arxiv.org/pdf/2301.13154.pdf