Abstract
When humans describe images they tend to use combina- tions of nouns and adjectives, corresponding to ob jects and their associated attributes respectively. To generate such a description auto- matically, one needs to model ob jects, attributes and their associations. Conventional methods require strong annotation of ob ject and attribute locations, making them less scalable. In this paper, we model ob ject- attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level la- bels (either ob ject or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including ob jects, attributes and their as- sociations, as well as their locations and segmentation. Extensive exper- iments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on ob ject-attribute associations.