Abstract
Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.
Abstract (translated)
查询产品相关性分析是电子商务搜索引擎的基础技术,在AI驱动的电子商务中变得越来越重要。最近出现的大规模语言模型(LLMs),特别是它们的链式思维(CoT)推理能力,为开发更加可解释和稳健的相关系统提供了巨大的机会。然而,现有的训练范式存在明显的局限:SFT和DPO在长尾查询上的泛化性能较差,并且缺乏细粒度、逐步监督来强制执行符合规则的推理。相比之下,基于验证奖励(RLVR)的强化学习由于反馈稀疏,提供的信号不足以纠正错误的中间步骤,从而削弱了逻辑一致性并限制了复杂推断场景中的表现。 为了解决这些问题,我们引入了一种名为淘宝搜索相关性步进混合检验增强学习框架(TaoSR-SHE)的方法。其核心是步进奖励策略优化(SRPO),这是一种强化学习算法,利用由高质量生成式逐步奖励模型和人工标注离线验证器组成的混合体产生的步骤级奖励,优先从关键的正确和错误推理步骤中进行学习。TaoSR-SHE还集成了两种关键技术:多样化数据过滤,鼓励在各种推理路径上探索并减轻策略熵塌陷;多阶段课程学习,促进能力渐进式增长。 在现实世界的搜索基准测试中的广泛实验表明,在大规模电子商务环境中,TaoSR-SHE提高了推理质量和相关性预测精度,并且优于SFT、DPO、GRPO和其他基线方法。此外,它还增强了系统的可解释性和鲁棒性。
URL
https://arxiv.org/abs/2510.07972