Abstract
BERT is a recent language representation
model that has surprisingly performed well in
diverse language understanding benchmarks.
This result indicates the possibility that BERT
networks capture structural information about
language. In this work, we provide novel support for this claim by performing a series of
experiments to unpack the elements of English
language structure learned by BERT. We first
show that BERT’s phrasal representation captures phrase-level information in the lower layers. We also show that BERT’s intermediate
layers encode a rich hierarchy of linguistic information, with surface features at the bottom,
syntactic features in the middle and semantic
features at the top. BERT turns out to require
deeper layers when long-distance dependency
information is required, e.g. to track subjectverb agreement. Finally, we show that BERT
representations capture linguistic information
in a compositional way that mimics classical,
tree-like structures