Abstract. For many computer vision applications, such as image description and human identification, recognizing the visual attributes of
humans is an essential yet challenging problem. Its challenges originate
from its multi-label nature, the large underlying class imbalance and the
lack of spatial annotations. Existing methods follow either a computer
vision approach while failing to account for class imbalance, or explore
machine learning solutions, which disregard the spatial and semantic
relations that exist in the images. With that in mind, we propose an
effective method that extracts and aggregates visual attention masks at
different scales. We introduce a loss function to handle class imbalance
both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak
supervision of the attention mechanism. By identifying and addressing
these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without
additional context or side information