Abstract
Detecting ob jects, estimating their pose and recovering 3D shape information are critical problems in many vision and robotics ap- plications. This paper addresses the above needs by proposing a new method called DEHV - Depth-Encoded Hough Voting detection scheme. Inspired by the Hough voting scheme introduced in [13], DEHV incor- porates depth information into the process of learning distributions of image features (patches) representing an ob ject category. DEHV takes advantage of the interplay between the scale of each ob ject patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D ob ject. DEHV jointly detects ob jects, infers their categories, estimates their pose, and infers/decodes ob jects depth maps from either a single image (when no depth maps are available in testing) or a single image augmented with depth map (when this is available in testing). Extensive quantitative and qualitative experimental analysis on existing datasets [6,9,22] and a newly proposed 3D table-top ob ject cate- gory dataset shows that our DEHV scheme obtains competitive detection and pose estimation results as well as convincing 3D shape reconstruc- tion from just one single uncalibrated image. Finally, we demonstrate that our technique can be successfully employed as a key building block in two application scenarios (highly accurate 6 degrees of freedom (6 DOF) pose estimation and 3D ob ject modeling).