Abstract
Effective options can make reinforcement learning
easier by enhancing an agent’s ability to both explore in a targeted manner and plan further into
the future. However, learning an appropriate model
of an option’s dynamics in hard, requiring estimating a highly parameterized probability distribution.
This paper introduces and motivates the ExpectedLength Model (ELM) for options, an alternate
model for transition dynamics. We prove ELM
is a (biased) estimator of the traditional MultiTime Model (MTM), but provide a non-vacuous
bound on their deviation. We further prove that, in
stochastic shortest path problems, ELM induces a
value function that is sufficiently similar to the one
induced by MTM, and is thus capable of supporting near-optimal behavior. We explore the practical
utility of this option model experimentally, finding
consistent support for the thesis that ELM is a suitable replacement for MTM. In some cases, we find
ELM leads to more sample efficient learning, especially when options are arranged in a hierarchy