Abstract
Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.
Abstract (translated)
在预算约束下优化广告商累积收益的挑战,在人工智能生成竞价(AIGB)框架中尤为复杂。由于缺乏大量历史互动数据,传统强化学习方法在处理个性化目标时表现不佳,特别是在数据量有限的场景中。大型语言模型(LLM)因其上下文学习能力而成为AIGB的一个有前途的选择,该能力使其能够从少量数据中进行泛化。然而,这些模型在数值精度方面存在不足,无法实现细致优化。 为了克服这一限制,我们提出了一种高效的后训练策略GRPO-自适应(GRPO-Adaptive),通过在训练过程中动态更新参考策略来增强LLM的推理能力和数值精度。在此基础上,我们进一步提出了DARA框架,这是一个新颖的双阶段架构,将决策过程分解为两个阶段:第一个阶段是一个少样本推断器,通过上下文提示生成初步计划;第二个阶段是精细优化器,在反馈驱动下细化这些初步计划。这种分离使DARA能够结合LLM在上下文学习方面的优势与AIGB任务所需的精确适应性。 我们在现实世界和合成数据环境中进行了广泛实验,证明我们的方法在预算约束条件下始终优于现有基线,为广告商累积收益带来了显著提升。
URL
https://arxiv.org/abs/2601.14711