Abstract
Spatial Transformer layers allow neural networks, at
least in principle, to be invariant to large spatial transformations in image data. The model has, however, seen
limited uptake as most practical implementations support
only transformations that are too restricted, e.g. affine or
homographic maps, and/or destructive maps, such as thin
plate splines. We investigate the use of flexible diffeomorphic image transformations within such networks and
demonstrate that significant performance gains can be attained over currently-used models. The learned transformations are found to be both simple and intuitive, thereby
providing insights into individual problem domains. With
the proposed framework, a standard convolutional neural
network matches state-of-the-art results on face verification
with only two extra lines of simple TensorFlow code