资源论文Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

2020-01-16 | |  52 |   42 |   0

Abstract

Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ïadvantage. 1 We2Ìshow that 1 Howard’s 1 PI 22 terminates 1 1 1 after at most 图片.png图片.png iterations, improving by a factor O(log 1 n) a result by1 [3], while 22 Simplex-PI 1 2 terminates 1 22 2 2 1 1 after at most 图片.png iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI # terminates after at most 图片.png图片.png图片.png iterations. This generalizes a recent result for deterministic MDPs by [8], in which 图片.png We explain why similar results seem hard to derive for Howard’s PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Howard’s PI terminates after at most 图片.pngiterations while Simplex-PI terminates after 图片.png图片.png iterations.

上一篇:Manifold-based Similarity Adaptation for Label Propagation

下一篇:Robust Bloom Filters for Large Multilabel Classification Tasks

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...