Abstract
Research on learning suitable feature descriptors for
Computer Vision has recently shifted to deep learning where
the biggest challenge lies with the formulation of appropriate loss functions, especially since the descriptors to be
learned are not known at training time. While approaches
such as Siamese and triplet losses have been applied with
success, it is still not well understood what makes a good
loss function. In this spirit, this work demonstrates that
many commonly used losses suffer from a range of problems. Based on this analysis, we introduce mixed-context
losses and scale-aware sampling, two methods that when
combined enable networks to learn consistently scaled descriptors for the first time