Abstract. We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval
from voice to face and from face to voice.
We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any
identity labels, using a form of cross-modal self-supervision; second, we
develop a curriculum learning schedule for hard negative mining targeted
to this task that is essential for learning to proceed successfully; third,
we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using
the joint embedding for automatically retrieving and labelling characters
in TV dramas