Abstract
In self-supervised learning, one trains a model to solve a
so-called pretext task on a dataset without the need for human annotation. The main objective, however, is to transfer
this model to a target domain and task. Currently, the most
effective transfer strategy is fine-tuning, which restricts one
to use the same model or parts thereof for both pretext and
target tasks. In this paper, we present a novel framework
for self-supervised learning that overcomes limitations in
designing and comparing different tasks, models, and data
domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific
fine-tuned model. This allows us to: 1) quantitatively assess previously incompatible models including handcrafted
features; 2) show that deeper neural network models can
learn better representations from the same pretext task; 3)
transfer knowledge learned with a deep model to a shallower one and thus boost its learning. We use this framework to design a novel self-supervised task, which achieves
state-of-the-art performance on the common benchmarks in
PASCAL VOC 2007, ILSVRC12 and Places by a significant
margin. Our learned features shrink the mAP gap between
models trained via self-supervised learning and supervised
learning from 5.9% to 2.6% in object detection on PASCAL
VOC 2007