Abstract. We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flflow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flflow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals. Our core idea is that for rigid regions we can use the predicted scene depth and camera motion to synthesize 2D optical flflow by backprojecting the induced 3D scene flflow. The discrepancy between the rigid flflow (from depth prediction and camera motion) and the estimated flflow (from optical flflow model) allows us to impose a cross-task consistency loss. While all the networks are jointly optimized during training, they can be applied independently at test time. Extensive experiments demonstrate that our depth and flflow models compare favorably with state-of-the-art unsupervised methods