Abstract
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: this https URL.
Abstract (translated)
形式验证(FV)在当前新兴的程序合成技术中见证了日益增长的重要性,这些技术正在演变大型语言模型(LLMs)。然而,目前的正式验证主要依赖于符号验证器或人工规则,导致对于广泛和灵活验证的限制。另一方面,像Isabelle这样的自动定理证明形式语言,作为另一条严谨的验证路线,使用全面的规则和定理。在本文中,我们提出了FVEL,一个带有LLM的交互式形式验证环境。具体来说,FVEL将给定的代码转换为Isabelle,然后通过LLM进行定理证明。结合范式利用Isabelle中严谨而丰富的规则和组织,同时也便于引入和调整最新的LLM。为实现这一目标,我们提取了一个大规模FVELER3。FVELER数据集包括在Isabelle中表述的代码依赖关系和验证过程,包括758个定理、29,125个引理和200,646个证明步骤,以及深入的依赖关系。我们在FVEL环境中通过首先用FVELER微调LLM,然后对它们在Code2Inv和SV-COMP上进行评估,来对比FVEL和LLM的性能。结果显示,带有FVELER微调的Llama3-8B解决了17.39%(69 -> 81)的问题,Mistral-7B则解决了12%(75 -> 84)的问题,证明错误的比例降低了。项目页面:此链接。
URL
https://arxiv.org/abs/2406.14408