BERT-chinese-text-classification-pytorch

Structure of the code

At the root of the project, you will see:

How to use the code

you need download pretrained chinese bert model

Download the Bert pretrained model from s3

Download the Bert config file from s3

Download the Bert vocab file from s3

modify bert-base-chinese-pytorch_model.bin to pytorch_model.bin , bert-base-chinese-config.json to config.json ,bert-base-chinese-vocab.txt to vocab.txt

place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.

pip install pytorch-transformers from github.

Prepare BaiduNet{password:ruxu}, you can modify the io.bert_processor.py to adapt your data.

Modify configuration information in pybert/config/base.py(the path of data,...).

Run python run_bert.py --do_data to preprocess data.

Run python run_bert.py --do_train --save_best to fine tuning bert model.

Run run_bert.py --do_test --do_lower_case to predict new data.

Fine-tuning result

training

Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800

classify_report

label	precision	recall	f1-score	support
财经	0.97	0.96	0.96	1500
体育	1.00	1.00	1.00	1500
娱乐	0.99	0.99	0.99	1500
家居	0.99	0.99	0.99	1500
房产	0.96	0.97	0.96	1500
教育	0.98	0.97	0.97	1500
时尚	0.99	0.98	0.99	1500
时政	0.97	0.98	0.98	1500
游戏	1.00	0.99	0.99	1500
科技	0.96	0.97	0.97	1500
avg / total	0.98	0.98	0.98	15000

label

precision

recall

f1-score

support

财经

0.97

0.96

1500

体育

1.00

1500

娱乐

0.99

1500

家居

0.99

1500

房产

0.96

0.97

0.96

1500

教育

0.98

0.97

1500

时尚

0.99

0.98

0.99

1500

时政

0.97

0.98

1500

游戏

1.00

0.99

1500

科技

0.96

0.97

1500

avg / total

0.98

15000

training figure

Tips

When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model

When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance

As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4

The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256

Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com

BERT Chinese text classification by PyTorch

Structure of the code

Dependencies

How to use the code

Fine-tuning result

training

classify_report

training figure

Tips