资源论文Large Scale Markov Decision Processes with Changing Rewards

Large Scale Markov Decision Processes with Changing Rewards

2020-02-20 | |  64 |   39 |   0

Abstract

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an p adversarial manner. We provide an algorithm that achieves a regret bound of 图片.png where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm’s computational complexity is polynomial in |S| and |A|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d |S|, we propose a modified algorithm with a computational complexity polynomial in d and independent of |S|. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first 图片.png regret bound in the large-scale MDP setting with adversarially changing rewards.

上一篇:Learning Bayesian Networks with Low Rank Conditional Probability Tables

下一篇:Global Guarantees for Blind Demodulation with Generative Priors

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...