Abstract
Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ($\leq$14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.
Abstract (translated)
Promptagator 方法证明,大型语言模型(LLM)通过少量样本提示可以被用作特定任务的查询生成器,以对领域专用的密集检索模型进行微调。然而,原始的 Promptagator 方法依赖于专有的大规模 LLM,这些模型用户可能无法访问或在处理敏感数据时被禁止使用。在这项工作中,我们研究了开源 LLM 在可获取规模(≤140亿参数)下的效果,作为替代方案。我们的结果显示,即使是只有30亿参数的开源 LLM 也能充当有效的 Promptagator 式查询生成器。希望我们的工作能为从业者提供可靠的选择来生成合成数据,并为领域特定应用的最大化微调结果提供见解。
URL
https://arxiv.org/abs/2510.02241