StereoNet: Guided Hierarchical Refinement for
Real-Time Edge-Aware Depth Prediction
Abstract. This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60fps on an NVidia
Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a
sub-pixel matching precision than is a magnitude higher than those of
traditional stereo matching approaches. This allows us to achieve realtime performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision. Spatial precision is achieved by employing a learned edge-aware upsampling
function. Our model uses a Siamese network to extract features from
the left and right image. A first estimate of the disparity is computed
in a very low resolution cost volume, then hierarchically the model reintroduces high-frequency details through a learned upsampling function
that uses compact pixel-to-pixel refinement networks. Leveraging color
input as a guide, this function is capable of producing high-quality edgeaware output. We achieve compelling results on multiple benchmarks,
showing how the proposed method offers extreme flexibility at an acceptable computational budget