Re-evaluating Complex Backups in Temporal Difference Learning

资源分类

2020-01-08 |

62 |

44 |

Abstract

We show that the 图片.png -return target used in the TD() family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an nstep return estimate increases with n. We introduce the eturn estimator, an alternative target based on a more accurate model of variance, which defines the 图片.png family of complex-backup temporal difference learning algorithms. We derive the -return equivalent of the original algorithm, which eliminates theparameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, 图片.png with a capacity parameter requires C times more time and memory than and is incremental and online. We show that outperforms for any setting of on 4 out of 5 benchmark domains, and that performs as well as or better than for intermediate settings of C.