Abstract
Accurate detection of objects in 3D point clouds is a
central problem in many applications, such as autonomous
navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a
region proposal network (RPN), most existing efforts have
focused on hand-crafted feature representations, for example, a bird’s eye view projection. In this work, we remove
the need of manual feature engineering for 3D point clouds
and propose VoxelNet, a generic 3D detection network that
unifies feature extraction and bounding box prediction into
a single stage, end-to-end trainable deep network. Specifi-
cally, VoxelNet divides a point cloud into equally spaced 3D
voxels and transforms a group of points within each voxel
into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way,
the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate
detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art
LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative
representation of objects with various geometries, leading
to encouraging results in 3D detection of pedestrians and
cyclists, based on only LiDAR