Abstract
Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system's ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.
Abstract (translated)
大量的研究表明,大型语言模型在基于逻辑的任务上的表现,而关于其在横向思维任务上生成创新解决方案的能力,相对较少研究。BrainTeaser共享任务测试横向思维并使用对抗数据集来防止记忆,导致离线模型的性能较差。我们提出了一个迭代式、连锁思维提示工程系统,通过人类评估优化提示。使用这个共享任务,我们证明了我们的系统通过优化提示和评估输入数据集显著提高了模型性能。
URL
https://arxiv.org/abs/2405.02517