Abstract
Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A pro jection from a low-level feature space to the seman- tic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inher- ent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the pro jection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift prob- lem and propose a novel framework, transductive multi-view embedding, to solve it. It is ‘transductive’ in that unlabelled target data points are explored for pro jection adaptation, and ‘multi-view’ in that both low- level feature (view) and multiple semantic representations (views) are embedded to rectify the pro jection shift. We demonstrate through ex- tensive experiments that our framework (1) rectifies the pro jection shift between the auxiliary and target domains, (2) exploits the complemen- tarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) en- ables novel cross-view annotation tasks.