资源论文Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

2020-02-21 | |  91 |   47 |   0

Abstract

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop noregret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable pwith probability β > 0 under any policy, we give a regret bound of 图片.png where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of 图片.png that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

上一篇:Faster Width-dependent Algorithm for Mixed Packing and Covering LPs

下一篇:Uncoupled Regression from Pairwise Comparison Data

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • Learning to learn...

    The move from hand-designed features to learned...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...