Locality-Sensitive Deconvolution Networks with Gated Fusion
for RGB-D Indoor Semantic Segmentation
Abstract
This paper focuses on indoor semantic segmentation using RGB-D data. Although the commonly used deconvolution networks (DeconvNet) have achieved impressive results
on this task, we find there is still room for improvements
in two aspects. One is about the boundary segmentation.
DeconvNet aggregates large context to predict the label of
each pixel, inherently limiting the segmentation precision of
object boundaries. The other is about RGB-D fusion. Recent state-of-the-art methods generally fuse RGB and depth
networks with equal-weight score fusion, regardless of the
varying contributions of the two modalities on delineating
different categories in different scenes. To address the two
problems, we first propose a locality-sensitive DeconvNet
(LS-DeconvNet) to refine the boundary segmentation over
each modality. LS-DeconvNet incorporates locally visual
and geometric cues from the raw RGB-D data into each
DeconvNet, which is able to learn to upsample the coarse
convolutional maps with large context whilst recovering
sharp object boundaries. Towards RGB-D fusion, we
introduce a gated fusion layer to effectively combine the
two LS-DeconvNets. This layer can learn to adjust the
contributions of RGB and depth over each pixel for highperformance object recognition. Experiments on the largescale SUN RGB-D dataset and the popular NYU-Depth v2
dataset show that our approach achieves new state-of-theart results for RGB-D indoor semantic segmentation