资源论文A new Q(λ) with interim forward view and Monte Carlo equivalence

A new Q(λ) with interim forward view and Monte Carlo equivalence

2020-03-03 | |  51 |   40 |   0

Abstract

Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa( ) from a pure bootstrapping form ( λ= 0) to a pure Monte Carlo form (λ = 1). In off-policy algorithms, including Q(λ), GQ( ), and off-policy LSTD(λ), the parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of , and as a result they fail to approximate Monte Carlo learning when λ= 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).

上一篇:One Practical Algorithm for Both Stochastic and Adversarial Bandits

下一篇:A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...