Abstract
Vector representations of sentences, trained
on massive text corpora, are widely used as
generic sentence embeddings across a variety
of NLP problems. The learned representations are generally assumed to be continuous
and real-valued, giving rise to a large memory
footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobile devices. In this paper, we propose four
different strategies to transform continuous
and generic sentence embeddings into a binarized form, while preserving their rich semantic information. The introduced methods
are evaluated across a wide range of downstream tasks, where the binarized sentence embeddings are demonstrated to degrade performance by only about 2% relative to their continuous counterparts, while reducing the storage requirement by over 98%. Moreover, with
the learned binary representations, the semantic relatedness of two sentences can be evaluated by simply calculating their Hamming distance, which is more computational efficient
compared with the inner product operation between continuous embeddings. Detailed analysis and case study further validate the effectiveness of proposed methods.