Abstract. It is expensive to label images with 3D structure or precise
camera pose. Yet, this is precisely the kind of annotation required to train
single-view 3D reconstruction models. In contrast, unlabeled images or
images with just category labels are easy to acquire, but few current
models can use this weak supervision. We present a unified framework
that can combine both types of supervision: a small amount of camera pose annotations are used to enforce pose-invariance and view-point
consistency, and unlabeled images combined with an adversarial loss are
used to enforce the realism of rendered, generated models. We use this
unified framework to measure the impact of each form of supervision in
three paradigms: semi-supervised, multi-task, and transfer learning. We
show that with a combination of these ideas, we can train single-view
reconstruction models that improve up to 7 points in performance (AP)
when using only 1% pose annotated training data.