LexNET for Noun Compound Relation Classification

This is a Tensorflow implementation of the LexNET algorithm for classifying relationships, specifically applied to classifying the relationships that hold between noun compounds:

olive oil is oil that is made from olives
cooking oil which is oil that is used for cooking
motor oil is oil that is contained in a motor

The model is a supervised classifier that predicts the relationship that holds between the constituents of a two-word noun compound using:

A neural "paraphrase" of each syntactic dependency path that connects the constituents in a large corpus. For example, given a sentence like This fine oil is made from first-press olives, the dependency path is something like oil <NSUBJPASS made PREP> from POBJ> olive.
The distributional information provided by the individual words; i.e., the word embeddings of the two consituents.
The distributional signal provided by the compound itself; i.e., the embedding of the noun compound in context.

The model includes several variants: path-based model uses (1) alone, the distributional model uses (2) alone, and the integrated model uses (1) and (2). The distributional-nc model and the integrated-nc model each add (3).

Training a model requires the following:

A collection of noun compounds that have been labeled using a relation inventory. The inventory describes the specific relationships that you'd like the model to differentiate (e.g. part of versus composed of versus purpose), and generally may consist of tens of classes. You can download the dataset used in the paper from here.
You'll need a collection of word embeddings: the path-based model uses the word embeddings as part of the path representation, and the distributional models use the word embeddings directly as prediction features.
The path-based model requires a collection of syntactic dependency parses that connect the constituents for each noun compound.

At the moment, this repository does not contain the tools for generating this data, but we will provide references to existing datasets and plan to add tools to generate the data in the future.

The following source code is included here:

learn_path_embeddings.py is a script that trains and evaluates a path-based model to predict a noun-compound relationship given labeled noun-compounds and dependency parse paths.
learn_classifier.py is a script that trains and evaluates a classifier based on any combination of paths, word embeddings, and noun-compound embeddings.
get_indicative_paths.py is a script that generates the most indicative syntactic dependency paths for a particular relationship.

Dependencies

TensorFlow: see detailed installation instructions at that site.
SciKit Learn: you can probably just install this with pip install sklearn.

Creating the Model

This section describes the necessary steps that you must follow to reproduce the results in the paper.

Generate/Download Path Data

TBD! Our plan is to make the aggregate path data available that was used to train path embeddings and classifiers; however, this will be released separately.

Generate/Download Embedding Data

TBD! While we used the standard Glove vectors for the relata embeddings, the NC embeddings were generated separately. Our plan is to make that data available, but it will be released separately.

Create Path Embeddings

Create the path embeddings using learn_path_embeddings.py. This shell script fragment will iterate through each dataset, split, and corpus to generate path embeddings for each.

for DATASET in tratz/fine_grained tratz/coarse_grained ; do
  for SPLIT in random lexical_head lexical_mod lexical_full ; do
    for CORPUS in wiki_gigiawords ; do
      python learn_path_embeddings.py 
        --dataset_dir ~/lexnet/datasets 
        --dataset "${DATASET}" 
        --corpus "${SPLIT}/${CORPUS}" 
        --embeddings_base_path ~/lexnet/embeddings 
        --logdir /tmp/learn_path_embeddings
    done
  done
done

The path embeddings will be placed in the directory specified by --embeddings_base_path.

Train classifiers

Train classifiers and evaluate on the validation and test data using train_classifiers.py script. This shell script fragment will iterate through each dataset, split, corpus, and model type to train and evaluate classifiers.

LOGDIR=/tmp/learn_classifier
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
  for SPLIT in random lexical_head lexical_mod lexical_full ; do
    for CORPUS in wiki_gigiawords ; do
      for MODEL in dist dist-nc path integrated integrated-nc ; do
        # Filename for the log that will contain the classifier results.
        LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
        python learn_classifier.py 
          --dataset_dir ~/lexnet/datasets 
          --dataset "${DATASET}" 
          --corpus "${SPLIT}/${CORPUS}" 
          --embeddings_base_path ~/lexnet/embeddings 
          --logdir ${LOGDIR} 
          --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
      done
    done
  done
done

The log file will contain the final performance (precision, recall, F1) on the train, dev, and test sets, and will include a confusion matrix for each.

Contact

If you have any questions, issues, or suggestions, feel free to contact either @vered1986 or @waterson.

If you use this code for any published research, please include the following citation:

Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. Vered Shwartz and Chris Waterson. NAACL 2018. link.

链接：https://github.com/tensorflow/models/tree/master/research/lexnet_nc

上一篇：LearningToRememberRareEvents

下一篇：LM1B

用户评价