Abstract
One might believe that model-based algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for model-based algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O( log N ), in contrast to the O(N log N ) bound available for the modelfree, delayed Q-learning algorithm. In this paper we show that Mormax, a modified version of the Rmax algorithm needs to make at most O(N log N ) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while our new bound improves the dependence on other problem parameters. In the reinforcement learning (RL) framework, an agent interacts with an unknown environment and tries to maximize its long-term profit. A standard way to measure the efficiency of the agent is sample complexity or exploration complexity. Roughly, this quantity tells how many non-optimal (exploratory) steps does the agent make at most. The best understood and most studied case is when theenvironment is a finite Markov decision process (MDP) with the expected total discounted reward criterion.Since the work of Kearns & Singh (1998), many algorithms have been published with bounds on their samAppearing in Proceedings of the 27 th International Confeence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).