Abstract
This paper presents a new approach that extends Deep Dyna-Q (DDQ) by incorporating
a Budget-Conscious Scheduling (BCS) to best
utilize a fixed, small amount of user interactions (budget) for learning task-oriented dialogue agents. BCS consists of (1) a Poissonbased global scheduler to allocate budget over
different stages of training; (2) a controller to
decide at each training step whether the agent
is trained using real or simulated experiences;
(3) a user goal sampling module to generate
the experiences that are most effective for policy learning. Experiments on a movie-ticket
booking task with simulated and real users
show that our approach leads to significant improvements in success rate over the state-ofthe-art baselines given the fixed budget