Abstract. Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually
compromise spatial resolution to achieve real-time inference speed, which
leads to poor performance. In this paper, we address this dilemma with
a novel Bilateral Segmentation Network (BiSeNet). We first design a
Spatial Path with a small stride to preserve the spatial information and
generate high-resolution features. Meanwhile, a Context Path with a fast
downsampling strategy is employed to obtain sufficient receptive field.
On top of the two paths, we introduce a new Feature Fusion Module
to combine features efficiently. The proposed architecture makes a right
balance between the speed and segmentation performance on Cityscapes,
CamVid, and COCO-Stuff datasets. Specifically, for a 2048×1024 input,
we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed
of 105 FPS on one NVIDIA Titan XP card, which is significantly faster
than the existing methods with comparable performance