Abstract
We investigate how annotators’ insensitivity
to differences in dialect can lead to racial
bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers
of African American English (AAE) and ratings of toxicity in several widely-used hate
speech datasets. Then, we show that models
trained on these corpora acquire and propagate
these biases, such that AAE tweets and tweets
by self-identified African Americans are up to
two times more likely to be labelled as offensive compared to others. Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that
when annotators are made explicitly aware of
an AAE tweet’s dialect they are significantly
less likely to label the tweet as offensive.