Abstract
In this work, we propose a novel approach that
predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object
bounding box annotations as the sole source
of supervision. Our proposed approach uses
a top-down attention mechanism to align entities in captions to objects in the image, and
then leverage the syntactic structure of the captions to align the relations. We use these
alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate
the effectiveness of our model on the Visual
Genome dataset by achieving a recall@50 of
15% and recall@100 of 25% on the relationships present in the image. We also show that
the model successfully predicts relations that
are not present in the corresponding captions.