Abstract
Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience
from a model is used to update policies or value functions. A key component of Dyna is search-control,
the mechanism to generate the state and action from
which the agent queries the model, which remains
largely unexplored. In this work, we propose to
generate such states by using the trajectory obtained
from Hill Climbing (HC) the current estimate of the
value function. This has the effect of propagating
value from high-value regions and of preemptively
updating value estimates of the regions that the agent
is likely to visit next. We derive a noisy projected
natural gradient algorithm for hill climbing, and
highlight a connection to Langevin dynamics. We
provide an empirical demonstration on four classical
domains that our algorithm, HC-Dyna, can obtain
significant sample efficiency improvements. We
study the properties of different sampling distributions for search-control, and find that there appears
to be a benefit specifically from using the samples
generated by climbing on current value estimates
from low-value to high-value region