Abstract. Signiicant progress has been made in monocular depth estimation with Convolutional Neural Networks (CNNs). While absolute
features, such as edges and textures, could be efectively extracted, the
depth constraint of neighboring pixels, namely relative features, has been
mostly ignored by recent CNN-based methods. To overcome this limitation, we explicitly model the relationships of diferent image locations
with an ainity layer and combine absolute and relative features in an
end-to-end network. In addition, we consider prior knowledge that major depth changes lie in the vertical direction, and thus, it is beneicial
to capture long-range vertical features for reined depth estimation. In
the proposed algorithm we introduce vertical pooling to aggregate image
features vertically to improve the depth accuracy. Furthermore, since the
Lidar depth ground truth is quite sparse, we enhance the depth labels
by generating high-quality dense depth maps with of-the-shelf stereo
matching method taking left-right image pairs as input. We also integrate multi-scale structure in our network to obtain global understanding of the image depth and exploit residual learning to help depth reinement. We demonstrate that the proposed algorithm performs favorably
against state-of-the-art methods both qualitatively and quantitatively
on the KITTI driving dataset.