Abstract
Q-learning in single-agent environments is known to converge in the limit given sufficient exploration. The same algorithm has been applied, with some success, in multiagent environments, where traditional analysis techniques break down. Using established dynamical systems methods, we derive and study an idealization of Q-learning in 2-player 2-action repeated general-sum games. In particular, we address the discontinuous case of -greedy exploration and use it as a proxy for value-based algorithms to highlight a contrast with existing results in policy search. Analogously to previous results for gradient ascent algorithms, we provide a complete catalog of the convergence behavior of the -greedy Q-learning algorithm by introducing new subclasses of these games. We identify two subclasses of Prisoner’s Dilemma-like games where the application of Q-learning with -greedy exploration results in higher-than-Nash average payoffs for some initial conditions.