Abstract
Pre-trained distributed word representations have
been proven useful in various natural language
processing (NLP) tasks. *However, the effect of
words’ geometric structure on word representations
has not been carefully studied yet. The existing
word representations methods underestimate the
words whose distances are close in the Euclidean
space, while overestimating words with a much
greater distance. In this paper, we propose a word
vector refinement model to correct the pre-trained
word embedding, which brings the similarity of
words in Euclidean space closer to word semantics
by using manifold learning. This approach is theoretically founded in the metric recovery paradigm.
Our word representations have been evaluated on a
variety of lexical-level intrinsic tasks (semantic relatedness, semantic similarity) and the experimental
results show that the proposed model outperforms
several popular word representations approaches