Abstract
Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
Abstract (translated)
大型语言模型(LLMs)在多个领域展现了强大的能力,然而它们在金融量化任务中的评估仍然碎片化,并且主要局限于知识为中心的问题回答。我们引入了QuantEval基准测试,它从定量金融的三个方面来评价LLMs:基于知识的问答、数量化的数学推理以及量化的策略编码。 与之前的财务基准不同,QuantEval整合了一个CTA风格的回测框架,该框架可以执行模型生成的策略,并使用财务绩效指标进行评估,从而能够更真实地衡量量化代码编写能力。我们对一些最先进的开源和专有LLMs进行了评价,观察到在推理和策略编码方面与人类专家存在显著差距。 最后,我们在领域内对齐的数据上进行了大规模监督微调和强化学习实验,显示出了持续的改进效果。我们希望QuantEval能促进对LLMs量化金融能力的研究,并加速它们在现实世界交易工作流程中的实际应用。此外,为了确保严格的可重复性,我们将完整的确定性回测配置(资产组合、成本模型及指标定义)一并发布。
URL
https://arxiv.org/abs/2601.08689