资源论文Average Reward Optimization Objective In Partially Observable Domains

Average Reward Optimization Objective In Partially Observable Domains

2020-03-02 | |  67 |   46 |   0

Abstract

We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs) (Littman et al., 2001). The key to average-reward computation is to have a welldefined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.

上一篇:Local Low-Rank Matrix Approximation

下一篇:Sequential Bayesian Search

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...