A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

资源分类

2019-11-05 |

106 |

48 |

Abstract Recently, a new multi-step temporal learning algorithm 图片.png unifies n-step Tree-Backup (when ) and n-step Sarsa (when ) by introducing a sampling parameter However, similar to other multi-step temporal-difference learning algorithms, needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the offline updates into efficient on-line ones which consume less memory and computation time. In this paper, we combine the original 图片.png with eligibility traces and propose a new algorithm, called is trace-decay parameter. This new algorithm unifies Sarsa() (when ) and ). Furthermore, we give an upper error bound of ) policy evaluation algorithm. We prove that ) control algorithm converges to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of 图片.png ) creates a mixture of the existing algorithms which learn the optimal value significantly faster than the extreme end ( = 0, or 1).