Abstract
Labels generated by human experts via comparisons exhibit smaller variance compared to traditional sample labels. Collecting comparison labels is challenging over large datasets, as the number of comparisons grows quadratically with the dataset size. We study the following experimental design problem: given a budget of expert comparisons, and a set of existing sample labels, we determine the comparison labels to collect that lead to the highest classification improvement. We study several experimental design objectives motivated by the BradleyTerry model. The resulting optimization problems amount to maximizing submodular functions. We experimentally evaluate the performance of these methods over synthetic and real-life datasets.