Abstract
Pre-trained text encoders have rapidly advanced the state of the art on many NLP
tasks. We focus on one such model, BERT,
and aim to quantify where linguistic information is captured within the network. We find
that the model represents the steps of the traditional NLP pipeline in an interpretable and
localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic
roles, then coreference. Qualitative analysis
reveals that the model can and often does adjust this pipeline dynamically, revising lowerlevel decisions on the basis of disambiguating
information from higher-level representations.