Unifying Molecular and Textual Representations via Multi-task Language Modelling

Abstract
Abstract (translated)
URL
PDF

Abstract

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

Abstract (translated)

近年来神经网络语言模型的最新进展也成功地应用于化学领域,提供了针对分子设计和合成规划的经典问题的创造性解决方案。这些新方法有潜力优化实验室操作并推动数据驱动的自动化在科学发现中的新时代。然而,每个任务通常仍然需要专门的模型,导致需要针对具体问题进行微调并忽略任务之间的关系。在这个领域中,我们提出了跨域多任务语言模型,以解决化学和自然语言领域中广泛的任务。通过利用多任务学习,我们的模型可以同时处理化学和自然语言,而无需在单个域或任务特定模型上进行昂贵的预先训练。有趣的是,跨域共享权重显著改善了我们的模型,在与单个域和跨域任务基准的比较中,特别是在跨域任务中共享信息产生了巨大的改善。特别地,跨域任务的信息共享导致了跨域任务的大规模改善,其程度随着规模的增长而增加,以超过12个相关指标测量。我们的工作表明,这种模型可以稳健高效地加速物理学领域的发现,通过取代具体问题的微调和提高人类模型交互的质量。

URL

https://arxiv.org/abs/2301.12586

PDF

https://arxiv.org/pdf/2301.12586.pdf