资源论文Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search

Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search

2020-01-16 | |  62 |   45 |   0

Abstract

We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative p Entropy Policy Search algorithm and show that its regret p after T episodes is 图片.png in the bandit setting and  图片.png in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.

上一篇:Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

下一篇:Reasoning With Neural Tensor Networks for Knowledge Base Completion

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...