Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes

资源分类

2020-02-19 |

91 |

67 |

Abstract

We show for the first time that it is possible to reconcile in online learning in zero-sum games two seemingly contradictory objectives: vanishing time-average regret and non-vanishing step sizes. This phenomenon, that we coin “fast and furious" learning in games, sets a new benchmark about what is possible both in max-min optimization as well as in multi-agent systems. Our analysis does not depend on introducing a carefully tailored dynamic. Instead we focus on the most well studied online dynamic, gradient descent. Similarly, we focus on the simplest textbook class of games, two-agent two-strategy zero-sum games, such as Matching Pennies. Even for this simplest of benchmarks the best known bound for total regret, prior to our work, was the trivial one of O(T ), which is immediately applicable even to a non-learning agent. Based on a tight understanding of the geometry of the non-equilibrating trajectories in the dual space we prove a regret bound of 图片.png matching the well known optimal bound for adaptive step sizes in the online setting. This guarantee holds for all fixed step-sizes without having to know the time horizon in advance and adapt the fixed step-size accordingly.As a corollary, we establish that even with fixed learning rates the time-average of mixed strategies, utilities converge to their exact Nash equilibrium values. We also provide experimental evidence suggesting the stronger regret bound holds for all zero-sum games.Tails, Heads Heads, Heads Regret vs Iteration Heads, Heads Regret2 vs Iteration 50 2500 40 2000 30 1500 20 1000 10 500Tails, Tails Heads, Tails 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 (a) Player Strategies (b) Player 1 Regret (c) Player 1 Regret Squared Figure 1: 5000 Iterations of Gradient Descent on Matching Pennies with ? = .15. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

上一篇：Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training

下一篇：Variance Reduced Policy Evaluation with Smooth Function Approximation

用户评价

全部评价

还没有评论，说两句吧！

热门资源

A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...
The Variational S...

Unlike traditional images which do not offer in...
Hierarchical Task...

We extend hierarchical task network planning wi...
Shape-based Autom...

We present an algorithm for automatic detection...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com