Double Thompson Sampling for Dueling Bandits

资源分类

2020-02-05 |

44 |

41 |

Abstract

In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As its name suggests, D-TS selects both the first and the second candidates according to Thompson Sampling. Specifically, D-TS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison according to two sets of samples independently drawn from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as a special case. For general Copeland dueling bandits, we show that D-TS achieves regret. Moreover, using a back substitution argument, we refine the regret to in Condorcet dueling bandits and most practical Copeland dueling bandits. In addition, we propose an enhancement of D-TS, referred to as D-TS+ , to reduce the regret in practice by carefully breaking ties. Experiments based on both synthetic and real-world data demonstrate that D-TS and D-TS+ significantly improve the overall performance, in terms of regret and robustness.

上一篇：Quantum Perceptron Models

下一篇：GAP Safe Screening Rules for Sparse-Group Lasso

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com