Abstract
Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.
Abstract (translated)
系统发育树阐明了物种之间的进化关系,但由于结合连续参数(分支长度)和离散参数(树的拓扑结构)的复杂性,系统发育推断仍然具有挑战性。传统的马尔可夫链蒙特卡罗方法面临收敛慢和计算负担重的问题。现有的变分推断方法通常需要预先生成的拓扑结构,并且往往独立处理树结构和分支长度,这可能导致忽略关键序列特征,从而限制了它们的准确性和灵活性。 我们提出了PhyloGen这一新方法,该方法利用预训练的基因组语言模型来生成和优化系统发育树,无需依赖进化模型或对齐序列约束。PhyloGen将系统发育推断视为一个受条件约束的树结构生成问题,并通过三个核心模块联合优化树的拓扑结构和分支长度:(i) 特征提取、(ii) 系统发育树构建、(iii) 系统发育树结构建模。同时,我们引入了一个评分函数来引导模型朝着更加稳定的梯度下降方向发展。 我们在八个真实世界的基准数据集上展示了PhyloGen的有效性和鲁棒性,并通过可视化结果证实了PhyloGen能够提供更深入的系统发育关系见解。
URL
https://arxiv.org/abs/2412.18827