Abstract
In this work, we revisit the global average pooling layerproposed in [13], and shed light on how it explicitly enablesthe convolutional neural network (CNN) to have remark-able localization ability despite being trained on image-level labels. While this technique was previously proposedas a means for regularizing training, we find that it actu-ally builds a generic localizable deep representation thatexposes the implicit attention of CNNs on an image. Despitethe apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation.We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1 .