Abstract
Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces reward as sequence-level supervision signal
to evaluate the quality of a generated sequence.
Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled
data is usually scarce and/or costly to collect, it is
desirable to leverage large-scale unlabeled data. In
this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We
propose to learn the reward function from labeled
data and use the predicted reward as pseudo reward
for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good
pseudo reward on unlabeled data, we propose a
RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can
provide good supervision and guide the learning
process on unlabeled data. We observe significant
improvements on both neural machine translation
and text summarization