Abstract
We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator $alpha$-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.
Abstract (translated)
我们提出了一个强化学习(RL)的研究,从人类强盗反馈的序列到序列学习,例举了强盗神经机器翻译(NMT)的任务。我们调查人类强盗反馈的可靠性,并分析可靠性对报酬估计的可学习性的影响以及报酬估计质量对整个RL任务的影响。我们对主要(5分评分)和有序(成对偏好)反馈的分析表明他们的内部和内部注释者$ alpha $协议是可比的。最佳的可靠性是获得标准化的基本反馈,基本反馈也是最容易学习和推广。最后,通过将针对800次翻译的基本反馈训练的基于回归的报酬估计器整合到NMT的RL中,可以获得超过1个BLEU的改进。这表明RL甚至可以从少量相当可靠的人类反馈中获得,这表明在更大规模的应用中具有巨大的潜力。
URL
https://arxiv.org/abs/1805.10627