Abstract
We present PointFusion, a generic 3D object detection
method that leverages both image and 3D point cloud information. Unlike existing methods that either use multistage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and applicationagnostic. The image data and the raw point cloud data are
independently processed by a CNN and a PointNet architecture, respectively. The resulting outputs are then combined by a novel fusion network, which predicts multiple
3D box hypotheses and their confidences, using the input
3D points as spatial anchors. We evaluate PointFusion on
two distinctive datasets: the KITTI dataset that features
driving scenes captured with a lidar-camera setup, and the
SUN-RGBD dataset that captures indoor environments with
RGB-D cameras. Our model is the first one that is able to
perform better or on-par with the state-of-the-art on these
diverse datasets without any dataset-specific model tuning