资源算法InferSent

InferSent

2019-09-11 | |  154 |   0 |   0

InferSent

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained English sentence encoder our paper and our SentEval evaluation toolkit.

Recent changes: Added infersent2 model trained on fastText vectors and added max-pool option.

Dependencies

This code is written in python. Dependencies include:

  • Python 2/3

  • Pytorch (recent version)

  • NLTK >= 3

Download datasets

To get SNLI and MultiNLI, run (in dataset/):

./get_data.bash

This will download and preprocess SNLI/MultiNLI datasets. For MacOS, you may have to use p7zip instead of unzip.

Download GloVe (V1) or fastText (V2) vectors:

mkdir dataset/GloVe
curl -Lo dataset/GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip dataset/GloVe/glove.840B.300d.zip -d dataset/GloVe/
mkdir dataset/fastText
curl -Lo dataset/fastText/crawl-300d-2M.vec.zip https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip
unzip dataset/fastText/crawl-300d-2M.vec.zip -d dataset/fastText/

Use our sentence encoder

We provide a simple interface to encode English sentences. See encoder/demo.ipynb

for a practical example. Get started with the following steps:

0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:

curl -Lo encoder/infersent1.pkl https://s3.amazonaws.com/senteval/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://s3.amazonaws.com/senteval/infersent/infersent2.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.

0.1) Make sure you have the NLTK tokenizer by running the following once:

import nltknltk.download('punkt')

1) Load our pre-trained model (in encoder/):

from models import InferSentV = 2MODEL_PATH = 'encoder/infersent%s.pkl' % Vparams_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}infersent = InferSent(params_model)infersent.load_state_dict(torch.load(MODEL_PATH))

2) Set word vector path for the model:

W2V_PATH = 'fastText/crawl-300d-2M.vec'infersent.set_w2v_path(W2V_PATH)

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common English words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK.

4) Encode your sentences (list of *n sentences):*

embeddings = infersent.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096. Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Model

Train model on Natural Language Inference (SNLI)

To reproduce our results on SNLI, run:

python train_nli.py --word_emb_path '<path to word embeddings>'

You should obtain a dev accuracy of 85 and a test accuracy of 84.5 with the default setting.

Evaluate the encoder on transfer tasks

To evaluate the model on transfer tasks, see SentEval. Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:

Model | MR | CR | SUBJ | MPQA | STS14 | STS Benchmark | SICK Relatedness | SICK Entailment | SST | TREC | MRPC :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: InferSent1 | 81.1 | 86.3 | 92.4 | 90.2 | .68/.65 | 75.8/75.5 | 0.884 | 86.1 | 84.6 | 88.2 | 76.2/83.1 InferSent2 | 79.7 | 84.2 | 92.7 | 89.4 | .68/.66 | 78.4/78.4 | 0.888 | 86.3 | 84.3 | 90.8 | 76.0/83.8 SkipThought | 79.4 | 83.1 | 93.7 | 89.3 | .44/.45 | 72.1/70.2| 0.858 | 79.5 | 82.9 | 88.4 | - fastText-BoV | 78.2 | 80.2 | 91.8 | 88.0 | .65/.63 | 70.2/68.3 | 0.823 | 78.9 | 82.3 | 83.4 | 74.4/82.4

Reference

Please consider citing [1] if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@InProceedings{conneau-EtAl:2017:EMNLP2017,
  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo"{i}c  and  Bordes, Antoine},
  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {670--680},
  url       = {https://www.aclweb.org/anthology/D17-1070}
}

Related work


上一篇:Training RNNs as Fast as CNNs

下一篇:pytorch-playground

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...