Abstract
Diagrams often depict complex phenomena and serve as
a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can
be addressed as a matching problem either between labeled
diagrams, images or both. This problem is very challenging
since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single
source image of that category. For this set-to-set matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained
using global normalization to maximize local match scores
between corresponding elements and a global consistency
score among all matched elements, while also enforcing a
matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagram-to-diagram, evaluated on
a new diagram dataset of over 200 categories; image-toimage, evaluated on a dataset built on top of the Pascal Part
Dataset; and image-to-diagram, evaluated on transferring
labels across these datasets.