资源算法BERT-chinese-text-classification-pytorch

BERT-chinese-text-classification-pytorch

2020-04-13 | |  39 |   0 |   0

BERT Chinese text classification by PyTorch

This repo contains a PyTorch implementation of a pretrained BERT model for chinese text classification.

Structure of the code

At the root of the project, you will see:

├── pybert |  └── callback |  |  └── lrscheduler.py   |  |  └── trainingmonitor.py  |  |  └── ... |  └── config |  |  └── base.py #a configuration file for storing model parameters |  └── dataset    |  └── io     |  |  └── bert_processor.py |  └── model |  |  └── nn  |  |  └── pretrain  |  └── output #save the ouput of model |  └── preprocessing #text preprocessing  |  └── train #used for training a model |  |  └── trainer.py  |  |  └── ... |  └── utils # a set of utility functions ├── run_bert.py

Dependencies

  • csv

  • tqdm

  • numpy

  • pickle

  • scikit-learn

  • PyTorch 1.0

  • matplotlib

  • pytorch_transformers=1.1.0

How to use the code

you need download pretrained chinese bert model

  1. Download the Bert pretrained model from s3

  2. Download the Bert config file from s3

  3. Download the Bert vocab file from s3

  4. modify bert-base-chinese-pytorch_model.bin to pytorch_model.bin , bert-base-chinese-config.json to config.json ,bert-base-chinese-vocab.txt to vocab.txt

  5. place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.

  6. pip install pytorch-transformers from github.

  7. Prepare BaiduNet{password:ruxu}, you can modify the io.bert_processor.py to adapt your data.

  8. Modify configuration information in pybert/config/base.py(the path of data,...).

  9. Run python run_bert.py --do_data to preprocess data.

  10. Run python run_bert.py --do_train --save_best to fine tuning bert model.

  11. Run run_bert.py --do_test --do_lower_case to predict new data.

Fine-tuning result

training

Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800

classify_report

labelprecisionrecallf1-scoresupport
财经0.970.960.961500
体育1.001.001.001500
娱乐0.990.990.991500
家居0.990.990.991500
房产0.960.970.961500
教育0.980.970.971500
时尚0.990.980.991500
时政0.970.980.981500
游戏1.000.990.991500
科技0.960.970.971500
avg / total0.980.980.9815000

training figure

image.png

Tips

  • When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model

  • When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance

  • As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4

  • The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256

  • Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way


上一篇:pytorch_pretrained_BERT

下一篇:BERT-Fine_tune

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...