deep-siamese-text-similarity
This project is a prototype for experimental purposes only and production grade code is not released here.
It is a tensorflow based implementation of deep siamese LSTM network to capture phrase/sentence similarity using character embeddings.
This code provides architecture for learning two kinds of tasks:
Phrase similarity using char level embeddings [1]
For both the tasks mentioned above it uses a multilayer siamese LSTM network and euclidian distance based contrastive loss to learn input pair similairty.
Given adequate training pairs, this model can learn Semantic as well as structural similarity. For eg:
Phrases :
International Business Machines = I.B.M
Synergy Telecom = SynTel
Beam inc = Beam Incorporate
Sir J J Smith = Johnson Smith
Alex, Julia = J Alex
James B. D. Joshi = James Joshi
James Beaty, Jr. = Beaty
For phrases, the model learns character based embeddings to identify structural/syntactic similarities.
Sentences :
He is smart = He is a wise man.
Someone is travelling countryside = He is travelling to a village.
She is cooking a dessert = Pudding is being cooked.
Microsoft to acquire Linkedin ≠ Linkedin to acquire microsoft
(More examples Ref: semEval dataset)
For Sentences, the model uses pre-trained word embeddings to identify semantic similarities.
Categories of pairs, it can learn as similar:
Annotations
Abbreviations
Extra words
Similar semantics
Typos
Compositions
Summaries
Phrases:
https://github.com/dhwajraj/dataset-person-name-disambiguation
"person_match.train" : https://drive.google.com/open?id=1HnMv7ulfh8yuq9yIrt_IComGEpDrNyo-
A sample set of learning person name paraphrases have been attached to this repository. To generate full person name disambiguation data follow the steps mentioned at:
Sentences:
"train_snli.txt" : https://drive.google.com/open?id=1itu7IreU_SyUSdmTWydniGxW-JEGTjrv
This data is generated using SNLI project :
alternate download location for "wiki.simple.vec" is : https://drive.google.com/open?id=1u79f3d2PkmePzyKgubkbxOjeaZCJgCrt
word embeddings: any set of pre-trained word embeddings can be utilized in this project. For our testing we had used fastText simple english embeddings from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
A sample set of learning sentence semantic similarity can be downloaded from:
numpy 1.11.0
tensorflow 1.2.1
gensim 1.0.1
nltk 3.2.2
$ python train.py [options/defaults] options: -h, --help show this help message and exit --is_char_based IS_CHAR_BASED is character based syntactic similarity to be used for phrases. if false then word embedding based semantic similarity is used. (default: True) --word2vec_model WORD2VEC_MODEL this flag will be used only if IS_CHAR_BASED is False word2vec pre-trained embeddings file (default: wiki.simple.vec) --word2vec_format WORD2VEC_FORMAT this flag will be used only if IS_CHAR_BASED is False word2vec pre-trained embeddings file format (bin/text/textgz)(default: text) --embedding_dim EMBEDDING_DIM Dimensionality of character embedding (default: 100) --dropout_keep_prob DROPOUT_KEEP_PROB Dropout keep probability (default: 0.5) --l2_reg_lambda L2_REG_LAMBDA L2 regularizaion lambda (default: 0.0) --max_document_words MAX_DOCUMENT_WORDS Max length (left to right max words to consider) in every doc, else pad 0 (default: 100) --training_files TRAINING_FILES Comma-separated list of training files (each file is tab separated format) (default: None) --hidden_units HIDDEN_UNITS Number of hidden units(default:50) --batch_size BATCH_SIZE Batch Size (default: 128) --num_epochs NUM_EPOCHS Number of training epochs (default: 200) --evaluate_every EVALUATE_EVERY Evaluate model on dev set after this many steps (default: 2000) --checkpoint_every CHECKPOINT_EVERY Save model after this many steps (default: 2000) --allow_soft_placement [ALLOW_SOFT_PLACEMENT] Allow device soft device placement --noallow_soft_placement --log_device_placement [LOG_DEVICE_PLACEMENT] Log placement of ops on devices --nolog_device_placement
$ python eval.py --model graph#.pb
Phrases:
Training time: (8 core cpu) = 1 complete epoch : 6min 48secs (training requires atleast 30 epochs)
Contrastive Loss : 0.0248
Evaluation performance : similarity measure for 100,000 pairs (8core cpu) = 1min 40secs
Accuracy 91%
Sentences:
Training time: (8 core cpu) = 1 complete epoch : 8min 10secs (training requires atleast 50 epochs)
Contrastive Loss : 0.0477
Evaluation performance : similarity measure for 100,000 pairs (8core cpu) = 2min 10secs
Accuracy 81%
上一篇:p3d_samples
下一篇:siamese-fc
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com