OurobjectofstudyistheBERTmodelintroducedin[6]. Tosetcontextandterminology,webriefly describe the model’s architecture. The input to BERT is based on a sequence of tokens (words or piecesofwords). Theoutputisasequenceofvectors,oneforeachinputtoken. Wewilloftenreferto thesevectorsascontextembeddingsbecausetheyincludeinformationaboutatoken’scontext. BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a pre-trained wordpiece embedding with position and segment information. Next, this initialsequenceofembeddingsisrunthroughmultipletransformerlayers,producinganewsequence ofcontextembeddingsateachstep. (BERTcomesintwoversions,a12-layerBERT-basemodeland a24-layerBERT-largemodel.) Implicitineachtransformerlayerisasetofattentionmatrices,one foreachattentionhead,eachofwhichcontainsascalarvalueforeachorderedpair (tokeni,token j)