Abstract
Object detection and 6D pose estimation in the crowd(scenes with multiple object instances, severe foregroundocclusions and background distractors), has become an im-portant problem in many rapidly evolving technological ar-eas such as robotics and augmented reality. Single shotbased 6D pose estimators with manually designed featuresare still unable to tackle the above challenges, motivat-ing the research towards unsupervised feature learning andnext-best-view estimation. In this work, we present a com-plete framework for both single shot-based 6D object poseestimation and next-best-view prediction based on HoughForests, the state of the art object pose estimator that per-forms classification and regression jointly. Rather than using manually designed features we a) propose an unsupervised feature learnt from depth-invariant patches using a Sparse Autoencoder and b) offer an extensive evaluation of various state of the art features. Furthermore, taking advantage of the clustering performed in the leaf nodes ofHough Forests, we learn to estimate the reduction of uncertainty in other views, formulating the problem of selecting the next-best-view. To further improve pose estimation, we propose an improved joint registration and hypotheses verification module as a final refinement step to reject false detections. We provide two additional challenging datasets inspired from realistic scenarios to extensively evaluate the state of the art and our framework. One is related to domestic environments and the other depicts a bin-picking scenario mostly found in industrial settings. We show that our framework significantly outperforms state of the art both on public and on our datasets.