Link the head to the “beak”: Zero Shot Learning
from Noisy Text Description at Part Precision
Abstract
In this paper, we study learning visual classifiers from unstructured text descriptions at part precision with no training
images. We propose a learning framework that is able to connect text terms to its relevant parts and suppress connections
to non-visual text terms without any part-text annotations.
For instance, this learning process enables terms like “beak”
to be sparsely linked to the visual representation of parts
like head, while reduces the effect of non-visual terms like
“migrate” on classifier prediction. Images are encoded by a
part-based CNN that detect bird parts and learn part-specific
representation. Part-based visual classifiers are predicted
from text descriptions of unseen visual classifiers to facilitate classification without training images (also known as
zero-shot recognition). We performed our experiments on
CUBirds 2011 dataset and improves the state-of-the-art textbased zero-shot recognition results from 34.7% to 43.6%.
We also created large scale benchmarks on North American
Bird Images augmented with text descriptions, where we also
show that our approach outperforms existing methods. Our
code, data, and models are publically available link [1].