Abstract
In the context of large space MDPs with linear value function approximation, we introduce a new approximate version of λ-Policy Iteration (Bertsekas & Ioffe, 1996), a method that generalizes Value Iteration and Policy Iteration with a parameter λ ∈ (0, 1). Our approach, called Least-Squares λ Policy Iteration, generalizes LSPI (Lagoudakis & Parr, 2003) which makes efficient use of training samples compared to classical temporaldifferences methods. The motivation of our work is to exploit the λ parameter within the least-squares context, and without having to generate new samples at each iteration or to know a model of the MDP. We provide a performance bound that shows the soundness of the algorithm. We show empirically on a simple chain problem and on the Tetris game that this λ parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained.