A new Q(λ) with interim forward view and Monte Carlo equivalence

资源分类

2020-03-03 |

51 |

40 |

Abstract

Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa( ) from a pure bootstrapping form ( λ= 0) to a pure Monte Carlo form (λ = 1). In off-policy algorithms, including Q(λ), GQ( ), and off-policy LSTD(λ), the parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of , and as a result they fail to approximate Monte Carlo learning when λ= 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).

上一篇：One Practical Algorithm for Both Stochastic and Adversarial Bandits

下一篇：A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com