Abstract
This work addresses the problem of estimating the 6D Pose of specific ob jects from a single RGB-D image. We present a flexible ap- proach that can deal with generic ob jects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D ob ject coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less ob jects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 ob jects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of ob jects and has capabilities to run fast.