Abstract
Recently, a new multi-step temporal learning algorithm unifies n-step Tree-Backup (when ) and n-step Sarsa (when ) by introducing a sampling parameter However, similar to other multi-step temporal-difference learning algorithms, needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the offline updates into efficient on-line ones which consume less memory and computation time. In this paper, we combine the original with eligibility traces and propose a new algorithm, called is trace-decay parameter. This new algorithm unifies Sarsa() (when ) and ). Furthermore, we give an upper error bound of ) policy evaluation algorithm. We prove that ) control algorithm converges to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of ) creates a mixture of the existing algorithms which learn the optimal value significantly faster than the extreme end ( = 0, or 1).