torchgpipe
A GPipe implementation in PyTorch. It is optimized for CUDA rather than TPU.
from torchgpipe import GPipe model = nn.Sequential(a, b, c, d) model = GPipe(model, balance=[1, 1, 1, 1], chunks=8) output = model(input)
GPipe is a scalable pipeline parallelism library published by Google Brain, which allows efficient training of large, memory-consuming models. According to the paper, GPipe can train a 25x larger model by using 8x devices (TPU), and train a model 3.5x faster by using 4x devices.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Google trained AmoebaNet-B with 557M parameters over GPipe. This model has achieved 84.3% top-1 and 97.0% top-5 accuracy on ImageNet classification benchmark (the state-of-the-art performance as of May 2019).
GPipe uses (a) pipeline parallelism and (b) automatic recomputation of the forward propagation during the backpropagation, hence leverages training a large model. We refer to (b) as checkpointing, following the well-known terminology in PyTorch community.
Pipeline Parallelism
GPipe splits a model into multiple partitions and places each partition on a different device to occupy more memory capacity. And it splits a mini-batch into multiple micro-batches to make the partitions work as parallel as possible.
Checkpointing
Checkpointing is applied to each partition to minimize the overall memory consumption by a model. During forward propagation, only the tensors at the boundaries between partitions are remembered. All other intermediate tensors are volatilized, and recomputed during backpropagation when necessary.
Currently, torchgpipe requires the following environments:
Python 3.6+
PyTorch 1.1+
To use torchgpipe, install it via PyPI:
$ pip install torchgpipe
To train a module with GPipe, simply wrap it with torchgpipe.GPipe
. Your module must be nn.Sequential
as GPipe will automatically split the module into partitions with consecutive layers. balance
argument determines the number of layers in each partition. chunks
argument specifies the number of micro-batches. Input, output, and intermediate tensors must be Tensor
or Tuple[Tensor, ...]
.
The below example code shows how to split a module with four layers into four partitions each having a single layer. This code also splits a mini-batch into 8 micro-batches:
from torchgpipe import GPipe model = nn.Sequential(a, b, c, d) model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)for input in data_loader: output = model(input)
Visit torchgpipe.readthedocs.io for more information including the API references.
Experiment | torchgpipe | GPipe (original) |
---|---|---|
naive-1 | 1x | 1x |
pipeline-1 | 0.756x | 0.8x |
pipeline-2 | 1.489x | 1.418x |
pipeline-4 | 2.629x | 2.182x |
pipeline-8 | 4.367x | 2.891x |
The table shows the reproduced speed benchmark on ResNet-101, as stated by reported in Figure 3(b) of the paper.
Naive-1 indicates the baseline setting that ResNet-101 on a single device is trained without GPipe. The speeds under other settings are measured relative to the speed of naive-1 (which is considered as the unit speed). Pipeline-k means k partitions with GPipe using k devices. Pipeline-1 is slower than naive-1 since it does not benefit from pipeline parallelism but has checkpointing overhead.
The reproducible code can be found in examples/resnet101_speed_benchmark.
Batch size | torchgpipe | nn.DataParallel | Goyal et al. |
---|---|---|---|
256 | 21.99±0.13 | 22.02±0.11 | 22.08±0.06 |
1k | 22.24±0.19 | 22.04±0.24 | N/A |
4k | 22.13±0.09 | N/A | N/A |
The table shows the reproduced accuracy(top-1 error rate) benchmark on ResNet-101, as stated by reported in Table 2(c) of Accurate, Large Minibatch SGD paper.
The reproducible code can be found in examples/resnet101_accuracy_benchmark.
Experiment | torchgpipe | GPipe (original) |
---|---|---|
naive-2 | 1x | 1x |
pipeline-2 | 1.434x | 1.156x |
pipeline-4 | 2.049x | 2.483x |
pipeline-8 | 2.424x | 3.442x |
The table shows the reproduced speed benchmark on AmoebaNet-D, as reported in Figure 3(a) of the paper. But there is some difference between torchgpipe and GPipe. We believe that this difference is not caused by the difference of torchgpipe and GPipe, rather by reimplementing the AmoebaNet-D model in TensorFlow for PyTorch. Results will be updated whenever a stable and reproducible AmoebaNet-D in PyTorch is available.
Naive-2 indicates the baseline setting that AmoebaNet-D on two devices is trained without GPipe. Pipeline-2 is a little faster than the paper, but pipeline-4 and pipeline-8 are slower.
Experiment | naive-1 | pipeline-1 | pipeline-2 | pipeline-4 | pipeline-8 | |||||
---|---|---|---|---|---|---|---|---|---|---|
torchgpipe | GPipe (original) | torchgpipe | GPipe (original) | torchgpipe | GPipe (original) | torchgpipe | GPipe (original) | torchgpipe | GPipe (original) | |
AmoebaNet-D (L, F) | (6, 208) | (6, 416) | (6, 544) | (12, 544) | (24, 512) | |||||
# of Model Parameters | 90M | 82M | 358M | 318M | 613M | 542M | 1.16B | 1.05B | 2.01B | 1.80B |
Total Peak Model Parameter Memory | 1.00GB | 1.05GB | 4.01GB | 3.80GB | 6.45GB | 6.45GB | 13.00GB | 12.53GB | 22.42GB | 24.62GB |
Total Peak Activation Memory | - | 6.26GB | 6.64GB | 3.46GB | 11.31GB | 8.11GB | 18.72GB | 15.21GB | 35.78GB | 26.24GB |
It shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of the paper. The size of an AmoebaNet-D model is determined by two hyperparameters L and F which are proportional to the number of layers and filters, respectively.
The difference between naive-1 and pipeline-1 indicates GPipe's capability to leverage training a larger model. With 8 GPUs, GPipe is capable of training a model which is 22 times larger compared to the naive-1 setting.
This project is functional, but the interface is not confirmed yet. All public APIs are subject to change without warning until v0.1.0.
torchgpipe project is developed by Heungsub Lee, Myungryong Jeong, and Chiheon Kim at Kakao Brain, with Sungbin Lim, Ildoo Kim, and Woonhyuk Baek's help. It is distributed under Apache License 2.0.
If you apply this library to any project and research, please cite our code:
@misc{torchgpipe, author = {Kakao Brain}, title = {torchgpipe, {A} {GPipe} implementation in {PyTorch}}, howpublished = {url{https://github.com/kakaobrain/torchgpipe}}, year = {2019} }
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com