Paper Reading AI Learner

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

2018-05-27 13:51:48
Julia Kreutzer, Joshua Uyheng, Stefan Riezler

Abstract

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator $alpha$-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.

Abstract (translated)

我们提出了一个强化学习(RL)的研究,从人类强盗反馈的序列到序列学习,例举了强盗神经机器翻译(NMT)的任务。我们调查人类强盗反馈的可靠性,并分析可靠性对报酬估计的可学习性的影响以及报酬估计质量对整个RL任务的影响。我们对主要(5分评分)和有序(成对偏好)反馈的分析表明他们的内部和内部注释者$ alpha $协议是可比的。最佳的可靠性是获得标准化的基本反馈,基本反馈也是最容易学习和推广。最后,通过将针对800次翻译的基本反馈训练的基于回归的报酬估计器整合到NMT的RL中,可以获得超过1个BLEU的改进。这表明RL甚至可以从少量相当可靠的人类反馈中获得,这表明在更大规模的应用中具有巨大的潜力。

URL

https://arxiv.org/abs/1805.10627

PDF

https://arxiv.org/pdf/1805.10627.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot