Abstract
In the problem of learning with label proportions
(also known as the problem of estimating class ratios), the training data is unlabeled, and only the
proportions of examples receiving each label are
given. The goal is to learn a hypothesis that predicts
the proportions of labels on the distribution underlying the sample. This model of learning is useful
in a wide variety of settings, including predicting
the number of votes for candidates in political elections from polls.
In this paper, we resolve foundational questions regarding the computational complexity of learning
in this setting. We formalize a simple version of the
setting, and we compare the computational complexity of learning in this model to classical PAC
learning. Perhaps surprisingly, we show that what
can be learned efficiently in this model is a strict
subset of what may be leaned efficiently in PAC,
under standard complexity assumptions. We give
a characterization in terms of VC dimension, and
we show that there are non-trivial problems in this
model that can be efficiently learned. We also give
an algorithm that demonstrates the feasibility of
learning under well-behaved distributions.