Abstract
Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. We include perturbations at multiple levels of abstractions (e.g. lexical perturbations such as typos, and semantic perturbations such as the inclusion of intermediate reasoning steps in the questions) to conduct behavioral analysis on the LLMs. Throughout our experiments, we find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.
Abstract (translated)
近年来,在提示工程方面的进步使得大型语言模型(LLMs)能够大大提高解决多级逻辑推理问题的精度。然而,现有工作在研究少样本提示方法下LLMs的鲁棒性方面还很少。因此,我们介绍了一种系统方法,通过领域无关的扰动来测试LLMs在多级推理任务中的鲁棒性。我们在多个抽象层次上包括扰动(例如,词义扰动(如拼写错误)和语义扰动(问题中包含中间推理步骤)),以对LLMs进行行为分析。在我们进行的实验中,我们发现模型对某些扰动(如用同义词替换单词)更加敏感。我们还证明了,在提示中增加被扰动的示例比例可以提高少样本提示方法的鲁棒性。
URL
https://arxiv.org/abs/2311.00258