Paper Reading AI Learner

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

2025-10-09 09:03:15
Pengkun Jiao, Yiming Jin, Jianhui Yang, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang

Abstract

Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

Abstract (translated)

查询产品相关性分析是电子商务搜索引擎的基础技术,在AI驱动的电子商务中变得越来越重要。最近出现的大规模语言模型(LLMs),特别是它们的链式思维(CoT)推理能力,为开发更加可解释和稳健的相关系统提供了巨大的机会。然而,现有的训练范式存在明显的局限:SFT和DPO在长尾查询上的泛化性能较差,并且缺乏细粒度、逐步监督来强制执行符合规则的推理。相比之下,基于验证奖励(RLVR)的强化学习由于反馈稀疏,提供的信号不足以纠正错误的中间步骤,从而削弱了逻辑一致性并限制了复杂推断场景中的表现。 为了解决这些问题,我们引入了一种名为淘宝搜索相关性步进混合检验增强学习框架(TaoSR-SHE)的方法。其核心是步进奖励策略优化(SRPO),这是一种强化学习算法,利用由高质量生成式逐步奖励模型和人工标注离线验证器组成的混合体产生的步骤级奖励,优先从关键的正确和错误推理步骤中进行学习。TaoSR-SHE还集成了两种关键技术:多样化数据过滤,鼓励在各种推理路径上探索并减轻策略熵塌陷;多阶段课程学习,促进能力渐进式增长。 在现实世界的搜索基准测试中的广泛实验表明,在大规模电子商务环境中,TaoSR-SHE提高了推理质量和相关性预测精度,并且优于SFT、DPO、GRPO和其他基线方法。此外,它还增强了系统的可解释性和鲁棒性。

URL

https://arxiv.org/abs/2510.07972

PDF

https://arxiv.org/pdf/2510.07972.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot