Abstract
Many structured prediction tasks in machine vision have
a collection of acceptable answers, instead of one definitive
ground truth answer. Segmentation of images, for example,
is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning
methods are typically optimized to make a single test-time
prediction for each query, failing to find other modes in the
output space. Existing methods that allow for sampling often sacrifice speed or accuracy.
We introduce a simple method for training a neural network, which enables diverse structured predictions to be
made for each test-time query. For a single input, we learn
to predict a range of possible answers. We compare favorably to methods that seek diversity through an ensemble of
networks. Such stochastic multiple choice learning faces
mode collapse, where one or more ensemble members fail
to receive any training signal. Our best performing solution can be deployed for various tasks, and just involves
small modifications to the existing single-mode architecture, loss function, and training regime. We demonstrate
that our method results in quantitative improvements across
three challenging tasks: 2D image completion, 3D volume
estimation, and flow prediction