Abstract
In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel p. In many problems, a good approximation of p is not needed. For instance, if from one state-action pair (s, a), one can only transit to states with the same value, learning p(·|s, a) accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) based on what we call the distribution-norm. The distributionnorm w.r.t. a measure is defined on zero v-mean functions f by the standard variation of f with respect to . We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the problem-free, loose || · ||1 concentration inequalities used in most previous analysis of RL algorithms, with a tighter problem-dependent hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.