Abstract
We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in
countable classes of general stochastic environments. These environments can be non-Markovian,
non-ergodic, and partially observable. We show
that Thompson sampling learns the environment
class in the sense that (1) asymptotically its value
converges in mean to the optimal value and (2)
given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.