Abstract. Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are
harmful. Specifically, image captioning models tend to exaggerate biases present
in training data (e.g., if a word is present in 60% of training sentences, it might
be predicted in 70% of sentences at test time). This can lead to incorrect captions
in domains where unbiased captions are desired, or required, due to over-reliance
on the learned prior and image context. In this work we investigate generation of
gender-specific caption words (e.g. man, woman) based on the person’s appearance or the image context. We introduce a new Equalizer model that encourages
equal gender probability when gender evidence is occluded in a scene and confi-
dent predictions when gender evidence is present. The resulting model is forced
to look at a person rather than use contextual cues to make a gender-specific prediction. The losses that comprise our model, the Appearance Confusion Loss and
the Confident Loss, are general, and can be added to any description model in
order to mitigate impacts of unwanted bias in a description dataset. Our proposed
model has lower error than prior work when describing images with people and
mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. Finally, we show that our
model more often looks at people when predicting their gender