Abstract
Convolutional neural networks (CNNs) have shown
great success in computer vision, approaching human-level
performance when trained for specific tasks via applicationspecific loss functions. In this paper, we propose a method
for augmenting and training CNNs so that their learned
features are compositional. It encourages networks to
form representations that disentangle objects from their surroundings and from each other, thereby promoting better
generalization. Our method is agnostic to the specific details of the underlying CNN to which it is applied and can in
principle be used with any CNN. As we show in our experiments, the learned representations lead to feature activations that are more localized and improve performance over
non-compositional baselines in object recognition tasks.