Abstract
We present the Visually Grounded Neural
Syntax Learner (VG-NSL), an approach for
learning syntactic representations and structures without explicit supervision. The model
learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and
matches them with images. We define the
concreteness of constituents by their matching
scores with images, and use it to guide the
parsing of text. Experiments on the MSCOCO
data set show that VG-NSL outperforms various unsupervised parsing approaches that do
not use visual grounding, in terms of F1 scores
against gold parse trees. We find that VGNSL is much more stable with respect to the
choice of random initialization and the amount
of training data. We also find that the concreteness acquired by VG-NSL correlates well
with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that
our model consistently outperforms prior unsupervised approaches