Abstract
We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: this https URL.
Abstract (translated)
我们研究的是扩展控制生成问题,也就是在训练范围内生成属性值超出范围序列的问题。这在自动化设计特别是在药物发现中非常重要,因为的目标是设计比现有序列更好的新蛋白质(例如,更稳定的),因此,根据定义,目标序列和其属性值超出了训练分布的范围,给试图直接生成目标序列的方法带来了挑战。相反,在本文中,我们提出了迭代控制的扩展生成(ICE),该方法迭代地对序列进行局部编辑,以进行扩展。我们训练了合成生成的序列对,这些序列证明了属性值微小的改进。在一个自然语言任务(情感分析)和两个蛋白质工程任务(ACE2稳定性和AAV fitness)中的结果表明,ICE显著优于现有方法,尽管其简单性。我们的代码和模型可在以下httpsURL获取:
URL
https://arxiv.org/abs/2303.04562