A Lazy Man’s Approach to Benchmarking:
Semisupervised Classifier Evaluation and Recalibration
Abstract
How many labeled examples are needed to estimate a classi?er’s performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performanc curves, with con?dence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervised Performance Evaluation (SPE), is based on a generative model for the classi?er’s con?dence scores. In addition to estimating the performance of classi?ers on new datasets, SPE can be used to recalibrate a classi?er by reestimating the class-conditional con?dence distributions.