Abstract. We present an end-to-end deep learning architecture for depth
map inference from multi-view images. In the network, we first extract
deep visual image features, and then build the 3D cost volume upon
the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial
depth map, which is then refined with the reference image to generate
the final output. Our framework flexibly adapts arbitrary N-view inputs
using a variance-based cost metric that maps multiple features into one
cost feature. The proposed MVSNet is demonstrated on the large-scale
indoor DTU dataset. With simple post-processing, our method not only
significantly outperforms previous state-of-the-arts, but also is several
times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before
April 18, 2018 without any fine-tuning, showing the strong generalization
ability of MVSNet.