Swivel in Tensorflow

This is a TensorFlow implementation of the Swivel algorithm for generating word embeddings.

Swivel works as follows:

Compute the co-occurrence statistics from a corpus; that is, determine how often a word c appears the context (e.g., "within ten words") of a focus word f. This results in a sparse co-occurrence matrix whose rows represent the focus words, and whose columns represent the context words. Each cell value is the number of times the focus and context words were observed together.
Re-organize the co-occurrence matrix and chop it into smaller pieces.
Assign a random embedding vector of fixed dimension (say, 300) to each focus word and to each context word.
Iteratively attempt to approximate the pointwise mutual information (PMI) between words with the dot product of the corresponding embedding vectors.

Note that the resulting co-occurrence matrix is very sparse (i.e., contains many zeros) since most words won't have been observed in the context of other words. In the case of very rare words, it seems reasonable to assume that you just haven't sampled enough data to spot their co-occurrence yet. On the other hand, if we've failed to observed two common words co-occuring, it seems likely that they are anti-correlated.

Swivel attempts to capture this intuition by using both the observed and the un-observed co-occurrences to inform the way it iteratively adjusts vectors. Empirically, this seems to lead to better embeddings, especially for rare words.

This release includes the following programs.

prep.py is a program that takes a text corpus and pre-processes it for training. Specifically, it computes a vocabulary and token co-occurrence statistics for the corpus. It then outputs the information into a format that can be digested by the TensorFlow trainer.
swivel.py is a TensorFlow program that generates embeddings from the co-occurrence statistics. It uses the files created by prep.py as input, and generates two text files as output: the row and column embeddings.
distributed.sh is a Bash script that is meant to act as a template for launching "distributed" Swivel training; i.e., multiple processes that work in parallel and communicate via a parameter server.
text2bin.py combines the row and column vectors generated by Swivel into a flat binary file that can be quickly loaded into memory to perform vector arithmetic. This can also be used to convert embeddings from Glove and word2vec into a form that can be used by the following tools.
nearest.py is a program that you can use to manually inspect binary embeddings.
eval.mk is a GNU makefile that fill retrieve and normalize several common word similarity and analogy evaluation data sets.
wordsim.py performs word similarity evaluation of the resulting vectors.
analogy performs analogy evaluation of the resulting vectors.
fastprep is a C++ program that works much more quickly that prep.py, but also has some additional dependencies to build.

Building Embeddings with Swivel

To build your own word embeddings with Swivel, you'll need the following:

A large corpus of text; for example, the dump of English Wikipedia.
A working TensorFlow implementation.
A machine with plenty of disk space and, ideally, a beefy GPU card. (We've experimented with the Nvidia Titan X, for example.)

You'll then run prep.py (or fastprep) to prepare the data for Swivel and run swivel.py to create the embeddings. The resulting embeddings will be output into two large text files: one for the row vectors and one for the column vectors. You can use those "as is", or convert them into a binary file using text2bin.py and then use the tools here to experiment with the resulting vectors.

Preparing the data for training

Once you've downloaded the corpus (e.g., to /tmp/wiki.txt), run prep.py to prepare the data for training:

./prep.py --output_dir /tmp/swivel_data --input /tmp/wiki.txt

By default, prep.py will make one pass through the text file to compute a "vocabulary" of the most frequent words, and then a second pass to compute the co-occurrence statistics. The following options allow you to control this behavior:

The prep.py program is pretty simple. Notably, it does almost no text processing: it does no case translation and simply breaks text into tokens by splitting on spaces. Feel free to experiment with the words function if you'd like to do something more sophisticated.

Unfortunately, prep.py is pretty slow. Also included is fastprep, a C++ equivalent that works much more quickly. Building fastprep.cc is a bit more involved: it requires you to pull and build the Tensorflow source code in order to provide the libraries and headers that it needs. See fastprep.mk for more details.

Training the embeddings

When prep.py completes, it will have produced a directory containing the data that the Swivel trainer needs to run. Train embeddings as follows:

./swivel.py --input_base_path /tmp/swivel_data 
   --output_base_path /tmp/swivel_data

There are a variety of parameters that you can fiddle with to customize the embeddings; some that you may want to experiment with include:

As mentioned above, access to beefy GPU will dramatically reduce the amount of time it takes Swivel to train embeddings.

When complete, you should find row_embeddings.tsv and col_embedding.tsv in the directory specified by --ouput_base_path. These files are tab-delimited files that contain one embedding per line. Each line contains the token followed by dim floating point numbers.

Exploring and evaluating the embeddings

There are also some simple tools you can to explore the embeddings. These tools work with a simple binary vector format that can be mmap-ed into memory along with a separate vocabulary file. Use text2bin.py to generate these files:

./text2bin.py -o vecs.bin -v vocab.txt /tmp/swivel_data/*_embedding.tsv

You can do some simple exploration using nearest.py:

./nearest.py -v vocab.txt -e vecs.bin
query> dog
dog
dogs
cat
...
query> man woman king
king
queen
princess
...

To evaluate the embeddings using common word similarity and analogy datasets, use eval.mk to retrieve the data sets and build the tools:

make -f eval.mk
./wordsim.py -v vocab.txt -e vecs.bin *.ws.tab
./analogy --vocab vocab.txt --embeddings vecs.bin *.an.tab

The word similarity evaluation compares the embeddings' estimate of "similarity" with human judgement using Spearman's rho as the measure of correlation. (Bigger numbers are better.)

The analogy evaluation tests how well the embeddings can predict analogies like "man is to woman as king is to queen".

Note that eval.mk forces all evaluation data into lower case. From there, both the word similarity and analogy evaluations assume that the eval data and the embeddings use consistent capitalization: if you train embeddings using mixed case and evaluate them using lower case, things won't work well.