Abstract
A key technical challenge in performing 6D object pose
estimation from RGB-D image is to fully leverage the two
complementary data sources. Prior works either extract information from the RGB image and depth separately or use
costly post-processing steps, limiting their performances in
highly cluttered scenes and real-time applications. In this
work, we present DenseFusion, a generic framework for
estimating 6D pose of a set of known objects from RGBD images. DenseFusion is a heterogeneous architecture
that processes the two data sources individually and uses a
novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement
procedure that further improves the pose estimation while
achieving near real-time inference. Our experiments show
that our method outperforms state-of-the-art approaches in
two datasets, YCB-Video and LineMOD. We also deploy our
proposed method to a real robot to grasp and manipulate
objects based on the estimated pose. Our code and video
are available at https://sites.google.com/view/densefusion/.