I have used BERT Token Classification Model to extract keywords from a sentence. Feel free to clone and use it. If you face any problems, kindly post it on issues section.
Special credits to BERT authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, original repo and Huggingface for PyTorch version original repo.
The keyword-extractor.py script can be used to extract keywords from a sentence and accepts the following arguments:
optional arguments:
-h, --help show this help message and exit
--sentence SEN sentence to extract keywords
--path LOAD path to load model from
Example:
python keyword-extractor.py --sentence "BERT is a great model." --path "model.pt"
Training
You can also train it from scratch using BERT's pre-trained model. The main.py script can be utilized for training and accepts the following arguments:
optional arguments:
-h, --help show this help message and exit
--data DATA location of the data corpus
--lr LR initial learning rate
--epochs EPOCHS upper epoch limit
--batch_size N batch size
--seq_len N sequence length
--save SAVE path to save the final model
This model has been trained on SemEval 2010 dataset (scientific publications). You can swap this with your own custom dataset.
Code explanations
I have provided the explanation of keyphrase extraction in the form of python notebook which you can view here
Hyper-parameter Tuning
I ran ablation experiments according to the BERT paper and these are the results. I suggest to use parameters in line 4. All training was done on batch size of 32.