Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 130,000 function re-write GPT output text blocks, approximately 40,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.

Abstract (translated)

生成预训练的变换器（GPT）是一种大型自然语言机器学习模型，特别擅长生成新颖且连贯的自然语言。在这项研究中，研究了 GPT 模型在生成新颖且正确的 SHA-1 哈希函数实现版本方面的能力，特别是非常不安全的实现版本。所使用的 GPT 模型包括 LLama-2-70b-chat-h、Mistral-7B-Instruct-v0.1 和 zephyr-7b-alpha。这些 GPT 模型使用修改的 localGPT 框架和 langchain，在每个函数上生成词嵌入上下文，并提供完整的源代码和头文件给模型，导致超过 130,000 个函数重写 GPT 输出文本块，其中大约 40,000 个被解析为 C 代码并后续编译。生成的代码被分析是否可编译、算法的正确性、内存泄漏、编译优化稳定性以及与参考实现的字符距离。值得注意的是，几个生成的函数变体在某些测试数据上的实现安全性非常高，但在其他测试数据上的实现是不正确的。此外，许多函数实现与 SHA-1 参考算法不正确，但生成了具有哈希函数的一些基本特征的哈希值。许多函数重写包含严重的漏洞，如内存泄漏、整数溢出、越界访问、使用未初始化值以及编译优化不稳定。编译优化设置和编译二进制文件的 SHA-256 哈希值检查用于将具有等效但可能不具有相同语法的实现聚类在一起 - 使用这种聚类在 SHA-1 代码库上生成了超过 100,000 个新颖且正确的函数版本，其中每个参考实现组件的 C 函数与原始代码不同。

URL

https://arxiv.org/abs/2404.15681

PDF

https://arxiv.org/pdf/2404.15681.pdf

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF