Abstract
Scaling up visual category recognition to large numbers of classes remains challenging. A promising research direc-tion is zero-shot learning, which does not require any train-ing data to recognize new classes, but rather relies on some form of auxiliary information describing the new classes.Ultimately, this may allow to use textbook knowledge that humans employ to learn about new classes by transferringknowledge from classes they know well. The most success-ful zero-shot learning approaches currently require a par-ticular type of auxiliary information – namely attribute an-notations performed by humans – that is not readily available for most classes. Our goal is to circumvent this bot-tleneck by substituting such annotations by extracting multiple pieces of information from multiple unstructured textsources readily available on the web. To compensate forthe weaker form of auxiliary information, we incorporatestronger supervision in the form of semantic part annotations on the classes from which we transfer knowledge. We achieve our goal by a joint embedding framework that maps multiple text parts as well as multiple semantic partsinto a common space. Our results consistently and significantly improve on the state-of-the-art in zero-short recognition and retrieval.