Abstract
Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic
scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts
in 2D image and 3D space while inferring their visibility
states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply
supervising its hidden layers, in order to sequentially infer
intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth
3D shape and relevant concepts, we render 3D object CAD
models to generate large-scale synthetic data and simulate
challenging occlusion configurations between objects. We
train the network only on synthetic data and demonstrate
state-of-the-art performances on real image benchmarks
including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and
instance segmentation. The empirical results substantiate
the utility of our deep supervision scheme by demonstrating
effective transfer of knowledge from synthetic data to real
images, resulting in less overfitting compared to standard
end-to-end training