BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for chinese text classification.
At the root of the project, you will see:
├── pybert | └── callback | | └── lrscheduler.py | | └── trainingmonitor.py | | └── ... | └── config | | └── base.py #a configuration file for storing model parameters | └── dataset | └── io | | └── bert_processor.py | └── model | | └── nn | | └── pretrain | └── output #save the ouput of model | └── preprocessing #text preprocessing | └── train #used for training a model | | └── trainer.py | | └── ... | └── utils # a set of utility functions ├── run_bert.py
csv
tqdm
numpy
pickle
scikit-learn
PyTorch 1.0
matplotlib
pytorch_transformers=1.1.0
you need download pretrained chinese bert model
Download the Bert pretrained model from s3
Download the Bert config file from s3
Download the Bert vocab file from s3
modify bert-base-chinese-pytorch_model.bin
to pytorch_model.bin
, bert-base-chinese-config.json
to config.json
,bert-base-chinese-vocab.txt
to vocab.txt
place model
,config
and vocab
file into the /pybert/pretrain/bert/base-uncased
directory.
pip install pytorch-transformers
from github.
Prepare BaiduNet{password:ruxu}, you can modify the io.bert_processor.py
to adapt your data.
Modify configuration information in pybert/config/base.py
(the path of data,...).
Run python run_bert.py --do_data
to preprocess data.
Run python run_bert.py --do_train --save_best
to fine tuning bert model.
Run run_bert.py --do_test --do_lower_case
to predict new data.
Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800
label | precision | recall | f1-score | support |
---|---|---|---|---|
财经 | 0.97 | 0.96 | 0.96 | 1500 |
体育 | 1.00 | 1.00 | 1.00 | 1500 |
娱乐 | 0.99 | 0.99 | 0.99 | 1500 |
家居 | 0.99 | 0.99 | 0.99 | 1500 |
房产 | 0.96 | 0.97 | 0.96 | 1500 |
教育 | 0.98 | 0.97 | 0.97 | 1500 |
时尚 | 0.99 | 0.98 | 0.99 | 1500 |
时政 | 0.97 | 0.98 | 0.98 | 1500 |
游戏 | 1.00 | 0.99 | 0.99 | 1500 |
科技 | 0.96 | 0.97 | 0.97 | 1500 |
avg / total | 0.98 | 0.98 | 0.98 | 15000 |
When converting the tensorflow checkpoint into the pytorch, it's expected to choice the "bert_model.ckpt", instead of "bert_model.ckpt.index", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model
When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance
As recommanded by Jocob in his paper https://arxiv.org/pdf/1810.04805.pdf, in fine-tuning tasks, the hyperparameters are expected to set as following: Batch_size: 16 or 32, learning_rate: 5e-5 or 2e-5 or 3e-5, num_train_epoch: 3 or 4
The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -> WordPieces -> Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256
Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com