Abstract
We consider an agent who is involved in an online Markov Decision Process, and receives a vector of outcomes every round. The agent optimizes an aggregate reward function on the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the contribution from each dimension for achieving near-optimality. Contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956, Agrawal and Devanur 2014) , UCRL2 (Jaksch et al. 2010), as well as a crucial and novel Gradient Threshold Procedure (GTP). GTP involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the aggregate reward.