Abstract
Latent-state environments with long horizons, such
as those faced by recommender systems, pose significant challenges for reinforcement learning (RL).
We identify and analyze several key hurdles for RL
in such environments, including belief state error
and small action advantage. We develop a general
principle called advantage amplification that can
overcome these hurdles through the use of temporal
abstraction. We propose several aggregation methods and prove they induce amplification in certain
settings. We also bound the loss in optimality incurred by our methods in environments where latent
state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task