Abstract
Interactive image segmentation is characterized by multimodality. When the user clicks on a door, do they intend to
select the door or the whole house? We present an end-toend learning approach to interactive image segmentation
that tackles this ambiguity. Our architecture couples two
convolutional networks. The first is trained to synthesize a
diverse set of plausible segmentations that conform to the
user’s input. The second is trained to select among these.
By selecting a single solution, our approach retains compatibility with existing interactive segmentation interfaces.
By synthesizing multiple diverse solutions before selecting
one, the architecture is given the representational power to
explore the multimodal solution space. We show that the
proposed approach outperforms existing methods for interactive image segmentation, including prior work that applied convolutional networks to this problem, while being
much faster