Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

资源分类

2020-02-21 |

119 |

69 |

Abstract

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop noregret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable pwith probability β > 0 under any policy, we give a regret bound of 图片.png where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

上一篇：Faster Width-dependent Algorithm for Mixed Packing and Covering LPs

下一篇：Uncoupled Regression from Pairwise Comparison Data

用户评价

全部评价

还没有评论，说两句吧！

热门资源

A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...
Hierarchical Task...

We extend hierarchical task network planning wi...
The Variational S...

Unlike traditional images which do not offer in...
Shape-based Autom...

We present an algorithm for automatic detection...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com