Abstract
In this paper, we present an end-to-end multi-level fusion based framework for 3D object detection from a single monocular image. The whole network is composed of
two parts: one for 2D region proposal generation and another for simultaneously predictions of objects’ 2D locations, orientations, dimensions, and 3D locations. With the
help of a stand-alone module to estimate the disparity and
compute the 3D point cloud, we introduce the multi-level
fusion scheme. First, we encode the disparity information
with a front view feature representation and fuse it with the
RGB image to enhance the input. Second, features extracted
from the original input and the point cloud are combined
to boost the object detection. For 3D localization, we introduce an extra stream to predict the location information
from point cloud directly and add it to the aforementioned
location prediction. The proposed algorithm can directly
output both 2D and 3D object detection results in an endto-end fashion with only a single RGB image as the input.
The experimental results on the challenging KITTI benchmark demonstrate that our algorithm significantly outperforms monocular state-of-the-art methods