Abstract
Word embeddings typically represent different meanings of a word in a single conflated
vector. Empirical analysis of embeddings of
ambiguous words is currently limited by the
small size of manually annotated resources
and by the fact that word senses are treated
as unrelated individual concepts. We present
a large dataset based on manual Wikipedia annotations and word senses, where word senses
from different words are related by semantic
classes. This is the basis for novel diagnostic tests for an embedding’s content: we probe
word embeddings for semantic classes and analyze the embedding space by classifying embeddings into semantic classes. Our main findings are: (i) Information about a sense is generally represented well in a single-vector embedding – if the sense is frequent. (ii) A classifier can accurately predict whether a word
is single-sense or multi-sense, based only on
its embedding. (iii) Although rare senses are
not well represented in single-vector embeddings, this does not have negative impact on an
NLP application whose performance depends
on frequent senses.