Abstract
Objects may appear at arbitrary scales in perspective
images of a scene, posing a challenge for recognition systems that process images at a fixed resolution. We propose a depth-aware gating module that adaptively selects
the pooling field size in a convolutional network architecture according to the object scale (inversely proportional to
the depth) so that small details are preserved for distant objects while larger receptive fields are used for those nearby.
The depth gating signal is provided by stereo disparity or
estimated directly from monocular input. We integrate this
depth-aware gating into a recurrent convolutional neural
network to perform semantic segmentation. Our recurrent
module iteratively refines the segmentation results, leveraging the depth and semantic predictions from the previous
iterations.
Through extensive experiments on four popular largescale datasets, we demonstrate this approach achieves competitive semantic segmentation performance with a model
which is substantially more compact. We carry out extensive
analysis of this architecture including variants that operate
on monocular RGB but use depth as side-information during training, unsupervised gating as a generic attentional
mechanism, and multi-resolution gating. We find that gated
pooling for joint semantic segmentation and depth yields
state-of-the-art results for quantitative monocular depth estimation