Abstract
We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error at each iteration is known to lead to stationary policies that are at least -optimal. Variations of Value and Policy Iteration, that build -periodic nonstationary policies, have recently been shown to display a better -optimality guaran tee. We describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m 0 and that generalizes all the above mentionned algorithms. While m allows one to interpolate between Value-Iteration-style and Policy-Iteration-style updates, specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved -optimal guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and `; this tight ness was to our knowledge only known in two specific cases, Value Iteration (m = 0,= 1) and Policy Iteration (m = ∞, = 1).