Abstract. One core challenge in object pose estimation is to ensure accurate
and robust performance for large numbers of diverse foreground objects amidst
complex background clutter. In this work, we present a scalable framework for
accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of
object classes from single or multiple views. To learn discriminative pose features,
we integrate three new capabilities into a deep Convolutional Neural Network
(CNN): an inference scheme that combines both classification and pose regression
based on a uniform tessellation of the Special Euclidean group in three dimensions
(SE(3)), the fusion of class priors into the training process via a tiled class map,
and an additional regularization using deep supervision with an object mask.
Further, an efficient multi-view framework is formulated to address single-view
ambiguity. We show that this framework consistently improves the performance of
the single-view network. We evaluate our method on three large-scale benchmarks:
YCB-Video, JHUScene-50 and ObjectNet-3D. Our approach achieves competitive
or superior performance over the current state-of-the-art methods.