Abstract
Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing visual information is tightly coupled with perception: we feel as if we see the world in 3 dimensions,while in fact, information from only the front surface of the world hits our (2D) retinas. This paper explores the connection between view-predictive representa-tion learning and its role in the development of 3D visual recognition. We propose inverse graphics networks, which take as input 2.5D video streams captured by a moving camera, and map to stable 3D feature maps of the scene, by disentangling the scene content from the motion of the camera. The model can also project its 3D feature maps to novel viewpoints, to predict and match against target views.We propose contrastive prediction losses that can handle stochasticity of the visual input and can scale view-predictive learning to more photorealistic scenes than those considered in previous works. We show that the proposed model learns 3D visual representations useful for (1) semi-supervised learning of 3D object detec-tors, and (2) unsupervised learning of 3D moving object detectors, by estimating motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the first work that empirically shows view prediction to be a useful and scalable self-supervised task beneficial to 3D object detection