Unobserved Is Not Equal to Non-existent: Using Gaussian Processes to Infer Immediate Rewards Across Contexts

资源分类

2019-10-08 |

100 |

40 |

Abstract Learning optimal policies in real-world domains with delayed rewards is a major challenge in Reinforcement Learning. We address the credit assignment problem by proposing a Gaussian Process (GP)-based immediate reward approximation algorithm and evaluate its effectiveness in 4 contexts where rewards can be delayed for long trajectories. In one GridWorld game and 8 Atari games, where immediate rewards are available, our results showed that on 7 out 9 games, the proposed GPinferred reward policy performed at least as well as the immediate reward policy and signifificantly outperformed the corresponding delayed reward policy. In e-learning and healthcare applications, we combined GP-inferred immediate rewards with of- flfline Deep Q-Network (DQN) policy induction and showed that the GP-inferred reward policies outperformed the policies induced using delayed rewards in both real-world contexts

上一篇：TransMS: Knowledge Graph Embedding for Complex Relations by Multidirectional Semantics

下一篇：Autoregressive Policies for Continuous Control Deep Reinforcement Learning

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Joint Pose and Ex...

Facial expression recognition (FER) is a challe...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com