bert-multi-gpu

2020-03-10 |

60 |

0 |

bert-multi-gpu

bert-multi-gpu

Feel free to fine tune large BERT models with large batch size easily. Multi-GPU and FP16 are supported.

Dependencies

Tensorflow

tensorflow >= 1.11.0 # CPU Version of TensorFlow.
tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow. (Upgrade to 1.14.0 when meets ImportError: No module named 'tensorflow.python.distribute.cross_device_ops' )

NVIDIA Collective Communications Library (NCCL)

Features

CPU/GPU/TPU Support
Multi-GPU Support: tf.distribute.MirroredStrategy is used to achieve Multi-GPU support for this project, which mirrors vars to distribute across multiple devices and machines. The maximum batch_size for each GPU is almost the same as bert. So global batch_size depends on how many GPUs there are.

global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
Assume: num_train_examples = 32000
Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
Situation 2 (single-gpu): train_batch_size = 32, num_gpu_cores = 1, num_train_epochs = 4
Result after training is equivalent between situation 1 and 2 when synchronous update on gradients is applied.

FP16 Support: FP16 allows you to use a larger batch_size. And training speed will increase by 70~100% on Volta GPUs, but may be slower on Pascal GPUs(REF1, REF2).
SavedModel Export

Usage

Run Classifier

List some optional parameters below:

task_name: The name of task which you want to fine tune, you can define your own task by implementing DataProcessor class.
do_lower_case: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is true.
do_train: Fine tune classifier or not. Default value is false.
do_eval: Evaluate classifier or not. Default value is false.
do_predict: Predict by classifier recovered from checkpoint or not. Default value is false.
save_for_serving: Output SavedModel for tensorflow serving. Default value is false.
data_dir: Your original input data directory.
vocab_file, bert_config_file, init_checkpoint: Files in BERT model directory.
max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is 128.
train_batch_size: Batch size for each GPU. For example, if train_batch_size is 16, and num_gpu_cores is 4, your GLOBAL batch size is 16 * 4 = 64.
learning_rate: Learning rate for Adam optimizer initialization.
num_train_epochs: Train epoch number.
use_gpu: Use GPU or not.
num_gpu_cores: Total number of GPU cores to use, only used if use_gpu is True.
use_fp16: Use FP16 or not.
output_dir: Checkpoints and SavedModel(.pb) files will be saved in this directory.

python run_custom_classifier.py 
  --task_name=QQP 
  --do_lower_case=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --save_for_serving=true 
  --data_dir=/cfs/data/glue/QQP 
  --vocab_file=/cfs/models/bert-large-uncased/vocab.txt 
  --bert_config_file=/cfs/models/bert-large-uncased/bert_config.json 
  --init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=32 
  --learning_rate=2e-5 
  --num_train_epochs=3.0 
  --use_gpu=true 
  --num_gpu_cores=4 
  --use_fp16=false 
  --output_dir=/cfs/outputs/bert-large-uncased-qqp

Shell script is available also (see run_custom_classifier.sh)

Optional params could be passed flexibly through command line.
CUDA_VISIBLE_DEVICES could be set and export as environmental variables when multi-gpus are used.

# refer to the variables acronymbash run_custom_classifier.sh -h# outputcurrent params setting:
-s max_seq_length,        default val is: 128
-g num_gpu_cores,         default val is: 4
-b train_batch_size,      default val is: 32
-l learning_rate,         default val is: 2e-5
-e num_train_epochs,      default val is: 3.0
-c CUDA_VISIBLE_DEVICES,  default val is: 0,1,2,3# example to pass paramsbash run_custom_classifier.sh -s 512 -b 8 -l 3e-5 -e 1 -g 2 -c 2,3

Run Multi-label Classification

Use case: In some situations, one example could be assigned to different groups, e.g. one movie could be tagged as romantic, commercial, boring with different aspects. As a result, multi-label classification should be applied rather than multi-class classification as labels are not exclusive (e.g. [1, 1, 0]).

One additional parameter 'num_labels' are required and other parameters keep similar to basic classifier.

python run_custom_classifier_mlabel.py 
  --num_labels=10 
  --task_name=Mlabel 
  --do_lower_case=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --save_for_serving=true 
  --data_dir=/cfs/data/Mlabel 
  --vocab_file=/cfs/models/bert-large-uncased/vocab.txt 
  --bert_config_file=/cfs/models/bert-large-uncased/bert_config.json 
  --init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=32 
  --learning_rate=2e-5 
  --num_train_epochs=3.0 
  --use_gpu=true 
  --num_gpu_cores=4 
  --use_fp16=false 
  --output_dir=/cfs/outputs/bert-large-uncased-mlabel

Run Sequence Labeling

List some optional parameters below:

task_name: The name of task which you want to fine tune, you can define your own task by implementing DataProcessor class.
do_lower_case: Whether to lower case the input text. Should be True for uncased models and False for cased models. Default value is true.
do_train: Fine tune model or not. Default value is false.
do_eval: Evaluate model or not. Default value is false.
do_predict: Predict by model recovered from checkpoint or not. Default value is false.
save_for_serving: Output SavedModel for tensorflow serving. Default value is false.
data_dir: Your original input data directory.
vocab_file, bert_config_file, init_checkpoint: Files in BERT model directory.
max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default value is 128.
train_batch_size: Batch size for each GPU. For example, if train_batch_size is 16, and num_gpu_cores is 4, your GLOBAL batch size is 16 * 4 = 64.
learning_rate: Learning rate for Adam optimizer initialization.
num_train_epochs: Train epoch number.
use_gpu: Use GPU or not.
num_gpu_cores: Total number of GPU cores to use, only used if use_gpu is True.
use_fp16: Use FP16 or not.
output_dir: Checkpoints and SavedModel(.pb) files will be saved in this directory.

python run_seq_labeling.py 
  --task_name=PUNCT 
  --do_lower_case=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --save_for_serving=true 
  --data_dir=/cfs/data/PUNCT 
  --vocab_file=/cfs/models/bert-large-uncased/vocab.txt 
  --bert_config_file=/cfs/models/bert-large-uncased/bert_config.json 
  --init_checkpoint=/cfs/models/bert-large-uncased/bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=32 
  --learning_rate=5e-5 
  --num_train_epochs=10.0 
  --use_gpu=true 
  --num_gpu_cores=4 
  --use_fp16=false 
  --output_dir=/cfs/outputs/bert-large-uncased-punct

What's More

Add custom task

You can define your own task data processor by implementing DataProcessor class.

Then, add your CustomProcessor to processors.

Finally, you can pass --task=your_task_name to python script.

# Create custom task data processor in run_custom_classifier.pyclass CustomProcessor(DataProcessor):    """Processor for the Custom data set."""

    def get_train_examples(self, data_dir):        """See base class."""
        return self._create_examples(read_custom_train_lines(data_dir), 'train')    def get_dev_examples(self, data_dir):        """See base class."""
        return self._create_examples(read_custom_dev_lines(data_dir), 'dev')    def get_test_examples(self, data_dir):        """See base class."""
        return self._create_examples(read_custom_test_lines(data_dir), 'test')    def get_labels(self):        """See base class."""
        return your_label_list # ["label-1", "label-2", "label-3", ..., "label-k"]

    def _create_examples(self, lines, set_type):        """Creates examples for the training/evaluation/testing sets."""
        examples = []        for (i, line) in enumerate(lines):            # text_b can be None
            (guid, text_a, text_b, label) = parse_your_data_line(line)
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))        return examples# Add CustomProcessor to processors in run_custom_classifier.pydef main(_):    # ...
    # Register 'custom' processor name to processors, and you can pass --task_name=custom to this script
    processors = {        "cola": ColaProcessor,        "mnli": MnliProcessor,        "mrpc": MrpcProcessor,        "xnli": XnliProcessor,        "qqp": QqpProcessor,        "custom": CustomProcessor,
    }    # ...

Tensorflow serving

If --save_for_serving=true is passed to run_custom_classifier.py or run_seq_labeling.py, python script will export SavedModel file to output_dir. Now you are good to go.

Install the SavedModel CLI by installing a pre-built Tensorflow binary(usually already installed on your system at pathname binsaved_model_cli) or building TensorFlow from source code.

Check your SavedModel file:

saved_model_cli show --dir <bert_savedmodel_output_path>/<timestamp> --all# For example:saved_model_cli show --dir tf_serving/bert_base_uncased_multi_gpu_qqp/1557722227/ --all# Output:# signature_def['serving_default']:#   The given SavedModel SignatureDef contains the following input(s):#     inputs['input_ids'] tensor_info:#         dtype: DT_INT32#         shape: (-1, 128)#         name: input_ids:0#     inputs['input_mask'] tensor_info:#         dtype: DT_INT32#         shape: (-1, 128)#         name: input_mask:0#     inputs['label_ids'] tensor_info:#         dtype: DT_INT32#         shape: (-1)#         name: label_ids:0#     inputs['segment_ids'] tensor_info:#         dtype: DT_INT32#         shape: (-1, 128)#         name: segment_ids:0#   The given SavedModel SignatureDef contains the following output(s):#     outputs['probabilities'] tensor_info:#         dtype: DT_FLOAT#         shape: (-1, 2)#         name: loss/Softmax:0#   Method name is: tensorflow/serving/predict

Install Bazel and compile tensorflow_model_server.

cd /your/path/to/tensorflow/serving
bazel build -c opt //tensorflow_serving/model_servers:tensorflow_model_server

Start tensorflow serving to listen on port for HTTP/REST API or gRPC API, tensorflow_model_server will initialize the models in <bert_savedmodel_output_path>.

# HTTP/REST APIbazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --rest_api_port=<rest_api_port> --model_name=<model_name> --model_base_path=<bert_savedmodel_output_path># For example:bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --rest_api_port=9000 --model_name=bert_base_uncased_qqp --model_base_path=/root/tf_serving/bert_base_uncased_multi_gpu_qqp --enable_batching=true# Output:# 2019-05-14 23:26:38.135575: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: bert_base_uncased_qqp version: 1557722227}# 2019-05-14 23:26:38.158674: I tensorflow_serving/model_servers/server.cc:324] Running gRPC ModelServer at 0.0.0.0:8500 ...# 2019-05-14 23:26:38.179164: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:9000 ...

Make a request to test your latest serving model.

curl -H "Content-type: application/json" -X POST -d '{"instances": [{"input_ids": [101,2054,2064,2028,2079,2044,16914,5910,1029,102,2054,2079,1045,2079,2044,2026,16914,5910,1029,102,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "input_mask": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "segment_ids": [0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "label_ids":[0]}]}'  "http://localhost:9000/v1/models/bert_base_uncased_qqp:predict"# Output:# {"predictions": [[0.608512461, 0.391487628]]}

Stargazers over time

图片.png

License

Apache License, please click to check more details.

Terms
Privacy
Security
Status
Help

上一篇： VL-BERT

下一篇：BERT_ChineseWordSegment

用户评价

全部评价

还没有评论，说两句吧！