Abstract
Zero-shot learning (ZSL) models rely on learning a joint
embedding space where both textual/semantic description
of object classes and visual representation of object images
can be projected to for nearest neighbour search. Despite
the success of deep neural networks that learn an end-toend model between text and images in other vision problems
such as image captioning, very few deep ZSL model exists
and they show little advantage over ZSL models that utilise
deep feature representations but do not learn an end-to-end
embedding. In this paper we argue that the key to make
deep ZSL models succeed is to choose the right embedding
space. Instead of embedding into a semantic space or an
intermediate space, we propose to use the visual space as
the embedding space. This is because that in this space,
the subsequent nearest neighbour search would suffer much
less from the hubness problem and thus become more effective. This model design also provides a natural mechanism
for multiple semantic modalities (e.g., attributes and sentence descriptions) to be fused and optimised jointly in an
end-to-end manner. Extensive experiments on four benchmarks show that our model significantly outperforms the
existing models.