资源论文Batch-Switching Policy Iteration

Batch-Switching Policy Iteration

2019-11-22 | |  67 |   47 |   0
Abstract Policy Iteration (PI) is a widely-used family of algorithms for computing an optimal policy for a given Markov Decision Problem (MDP). Starting with an arbitrary initial policy, PI repeatedly updates to a dominating policy until an optimal policy is found. The update step involves switching the actions corresponding to a set of “improvable” states, which are easily identified. Whereas progress is guaranteed even if just one improvable state is switched at every step, the canonical variant of PI, attributed to Howard [1960], switches every improvable state in order to obtain the next iterate. For MDPs with n states and 2 actions per state, the tightest known bound on the complexity of Howard’s PI is O(2n /n) iterations. To date, the tightest bound known across all variants of PI is O(1.7172n ) expected iterations for a randomised variant introduced by Mansour and Singh [1999]. We introduce Batch-Switching Policy Iteration (BSPI), a family of deterministic PI algorithms that switches states in “batches”, taking the batch size b as a parameter. By varying b, BSPI interpolates between Howard’s PI and another previously-studied variant called Simple PI [Melekopoglou and Condon, 1994]. Our main contribution is a bound of O(1.6479n ) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard’s PI might itself enjoy an even tighter bound.

上一篇:Anticipatory Troubleshooting

下一篇:In Search of Tractability for Partial Satisfaction Planning

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...