Large Scale Markov Decision Processes with Changing Rewards

资源分类

2020-02-20 |

64 |

39 |

Abstract

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an p adversarial manner. We provide an algorithm that achieves a regret bound of 图片.png where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm’s computational complexity is polynomial in |S| and |A|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d |S|, we propose a modified algorithm with a computational complexity polynomial in d and independent of |S|. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first 图片.png regret bound in the large-scale MDP setting with adversarially changing rewards.

上一篇：Learning Bayesian Networks with Low Rank Conditional Probability Tables

下一篇：Global Guarantees for Blind Demodulation with Generative Priors

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com