资源论文Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

2020-02-23 | |  55 |   43 |   0

Abstract

We consider an agent who is involved in an online Markov Decision Process, and receives a vector of outcomes every round. The agent optimizes an aggregate reward function on the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the contribution from each dimension for achieving near-optimality. Contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956, Agrawal and Devanur 2014) , UCRL2 (Jaksch et al. 2010), as well as a crucial and novel Gradient Threshold Procedure (GTP). GTP involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the aggregate reward.

上一篇:Surfing: Iterative Optimization Over Incrementally Trained Deep Networks

下一篇:Statistical Model Aggregation via Parameter Matching

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...