Abstract
Deep networks have recently enjoyed enormous successwhen applied to recognition and classification problems incomputer vision [22, 33], but their use in graphics problemshas been limited ([23, 7] are notable recent exceptions). Inthis work, we present a novel deep architecture that per-forms new view synthesis directly from pixels, trained froma large number of posed image sets. In contrast to tradi-tional approaches, which consist of multiple complex stagesof processing, each of which requires careful tuning and canfail in unexpected ways, our system is trained end-to-end.The pixels from neighboring views of a scene are presentedto the network, which then directly produces the pixels ofthe unseen view. The benefits of our approach include gen-erality (we only require posed image sets and can easilyapply our method to different domains), and high qualityresults on traditionally difficult scenes. We believe this is due to the end-to-end nature of our system, which is able to plausibly generate pixels according to color, depth, and texture priors learnt automatically from the training data. Weshow view interpolation results on imagery from the KITTIdataset [12], from data from [1] as well as on Google Street View images. To our knowledge, our work is the first to apply deep learning to the problem of new view synthesis from sets of real-world, natural imagery.