Computing Approximate Equilibria in Sequential
Adversarial Games by Exploitability Descent
Abstract
In this paper, we present exploitability descent, a
new algorithm to compute approximate equilibria
in two-player zero-sum extensive-form games with
imperfect information, by direct policy optimization against worst-case opponents. We prove that
when following this optimization, the exploitability of a player’s strategy converges asymptotically
to zero, and hence when both players employ this
optimization, the joint policies converge to a Nash
equilibrium. Unlike fictitious play (XFP) and counterfactual regret minimization (CFR), our convergence result pertains to the policies being optimized
rather than the average policies. Our experiments
demonstrate convergence rates comparable to XFP
and CFR in four benchmark games in the tabular
case. Using function approximation, we find that
our algorithm outperforms the tabular version in two
of the games, which, to the best of our knowledge, is
the first such result in imperfect information games
among this class of algorithms.