Abstract
We investigate the problem of producing structured
graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures
in scene graphs. We present new quantitative insights on
such repeated structures in the Visual Genome dataset. Our
analysis shows that object labels are highly predictive of
relation labels but not vice-versa. We also find that there
are recurring patterns even in larger subgraphs: more than
50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between
object pairs with the given labels, as seen in the training
set. This baseline improves on the previous state-of-the-art
by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a
new architecture designed to capture higher order motifs in
scene graphs that further improves over our strong baseline
by an average 7.1% relative gain. Our code is available at
github.com/rowanz/neural-motifs