Safe and Sample-Efficient Reinforcement Learning Algorithms
for Factored Environments
Abstract
Reinforcement Learning (RL) deals with problems
that can be modeled as a Markov Decision Process
(MDP) where the transition function is unknown.
In situations where an arbitrary policy ? is already
in execution and the experiences with the environment were recorded in a batch D, an RL algorithm
can use D to compute a new policy ?0
. However,
the policy computed by traditional RL algorithms
might have worse performance compared to ?. Our
goal is to develop safe RL algorithms, where the
agent has a high confidence that the performance
of ?0
is better than the performance of ? given D.
To develop sample-efficient and safe RL algorithms
we combine ideas from exploration strategies in RL
with a safe policy improvement method.