Abstract
A multi-view image sequence provides a much richercapacity for object recognition than from a single image.However, most existing solutions to multi-view recognitiontypically adopt hand-crafted, model-based geometric meth-ods, which do not readily embrace recent trends in deeplearning. We propose to bring Convolutional Neural Networks to generic multi-view recognition, by decomposing an image sequence into a set of image pairs, classifyingeach pair independently, and then learning an object classi-fier by weighting the contribution of each pair. This allowsfor recognition over arbitrary camera trajectories, withoutrequiring explicit training over the potentially infinite number of camera paths and lengths. Building these pairwiserelationships then naturally extends to the next-best-view problem in an active recognition framework. To achievethis, we train a second Convolutional Neural Network tomap directly from an observed image to next viewpoint. Finally, we incorporate this into a trajectory optimisation task, whereby the best recognition confidence is sought for a given trajectory length. We present state-of-the-art results in both guided and unguided multi-view recognition on the ModelNet dataset, and show how our method can be used with depth images, greyscale images, or both.