torchgpipe

2019-10-10 |

163 |

0 |

torchgpipe

torchgpipe

A GPipe implementation in PyTorch. It is optimized for CUDA rather than TPU.

from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

What is GPipe?

GPipe is a scalable pipeline parallelism library published by Google Brain, which allows efficient training of large, memory-consuming models. According to the paper, GPipe can train a 25x larger model by using 8x devices (TPU), and train a model 3.5x faster by using 4x devices.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Google trained AmoebaNet-B with 557M parameters over GPipe. This model has achieved 84.3% top-1 and 97.0% top-5 accuracy on ImageNet classification benchmark (the state-of-the-art performance as of May 2019).

GPipe uses (a) pipeline parallelism and (b) automatic recomputation of the forward propagation during the backpropagation, hence leverages training a large model. We refer to (b) as checkpointing, following the well-known terminology in PyTorch community.

Pipeline Parallelism
GPipe splits a model into multiple partitions and places each partition on a different device to occupy more memory capacity. And it splits a mini-batch into multiple micro-batches to make the partitions work as parallel as possible.
Checkpointing
Checkpointing is applied to each partition to minimize the overall memory consumption by a model. During forward propagation, only the tensors at the boundaries between partitions are remembered. All other intermediate tensors are volatilized, and recomputed during backpropagation when necessary.

Usage

Currently, torchgpipe requires the following environments:

Python 3.6+
PyTorch 1.1+

To use torchgpipe, install it via PyPI:

$ pip install torchgpipe

To train a module with GPipe, simply wrap it with torchgpipe.GPipe. Your module must be nn.Sequential as GPipe will automatically split the module into partitions with consecutive layers. balance argument determines the number of layers in each partition. chunks argument specifies the number of micro-batches. Input, output, and intermediate tensors must be Tensor or Tuple[Tensor, ...].

The below example code shows how to split a module with four layers into four partitions each having a single layer. This code also splits a mini-batch into 8 micro-batches:

from torchgpipe import GPipe

model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)for input in data_loader:
    output = model(input)

Documentation

Visit torchgpipe.readthedocs.io for more information including the API references.

Benchmarking

ResNet-101 Speed Benchmark

Experiment	torchgpipe	GPipe (original)
naive-1	1x	1x
pipeline-1	0.756x	0.8x
pipeline-2	1.489x	1.418x
pipeline-4	2.629x	2.182x
pipeline-8	4.367x	2.891x

The table shows the reproduced speed benchmark on ResNet-101, as stated by reported in Figure 3(b) of the paper.

Naive-1 indicates the baseline setting that ResNet-101 on a single device is trained without GPipe. The speeds under other settings are measured relative to the speed of naive-1 (which is considered as the unit speed). Pipeline-k means k partitions with GPipe using k devices. Pipeline-1 is slower than naive-1 since it does not benefit from pipeline parallelism but has checkpointing overhead.

The reproducible code can be found in examples/resnet101_speed_benchmark.

ResNet-101 Accuracy Benchmark

Batch size	torchgpipe	nn.DataParallel	Goyal et al.
256	21.99±0.13	22.02±0.11	22.08±0.06
1k	22.24±0.19	22.04±0.24	N/A
4k	22.13±0.09	N/A	N/A

The table shows the reproduced accuracy(top-1 error rate) benchmark on ResNet-101, as stated by reported in Table 2(c) of Accurate, Large Minibatch SGD paper.

The reproducible code can be found in examples/resnet101_accuracy_benchmark.

AmoebaNet-D Speed Benchmark

Experiment	torchgpipe	GPipe (original)
naive-2	1x	1x
pipeline-2	1.434x	1.156x
pipeline-4	2.049x	2.483x
pipeline-8	2.424x	3.442x

The table shows the reproduced speed benchmark on AmoebaNet-D, as reported in Figure 3(a) of the paper. But there is some difference between torchgpipe and GPipe. We believe that this difference is not caused by the difference of torchgpipe and GPipe, rather by reimplementing the AmoebaNet-D model in TensorFlow for PyTorch. Results will be updated whenever a stable and reproducible AmoebaNet-D in PyTorch is available.

Naive-2 indicates the baseline setting that AmoebaNet-D on two devices is trained without GPipe. Pipeline-2 is a little faster than the paper, but pipeline-4 and pipeline-8 are slower.

AmoebaNet-D Memory Benchmark

Experiment	naive-1		pipeline-1		pipeline-2		pipeline-4		pipeline-8
Experiment	torchgpipe	GPipe (original)	torchgpipe	GPipe (original)	torchgpipe	GPipe (original)	torchgpipe	GPipe (original)	torchgpipe	GPipe (original)
AmoebaNet-D (L, F)	(6, 208)		(6, 416)		(6, 544)		(12, 544)		(24, 512)
# of Model Parameters	90M	82M	358M	318M	613M	542M	1.16B	1.05B	2.01B	1.80B
Total Peak Model Parameter Memory	1.00GB	1.05GB	4.01GB	3.80GB	6.45GB	6.45GB	13.00GB	12.53GB	22.42GB	24.62GB
Total Peak Activation Memory	-	6.26GB	6.64GB	3.46GB	11.31GB	8.11GB	18.72GB	15.21GB	35.78GB	26.24GB

It shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of the paper. The size of an AmoebaNet-D model is determined by two hyperparameters L and F which are proportional to the number of layers and filters, respectively.

The difference between naive-1 and pipeline-1 indicates GPipe's capability to leverage training a larger model. With 8 GPUs, GPipe is capable of training a model which is 22 times larger compared to the naive-1 setting.

Notes

This project is functional, but the interface is not confirmed yet. All public APIs are subject to change without warning until v0.1.0.

Authors and Licensing

torchgpipe project is developed by Heungsub Lee, Myungryong Jeong, and Chiheon Kim at Kakao Brain, with Sungbin Lim, Ildoo Kim, and Woonhyuk Baek's help. It is distributed under Apache License 2.0.

Citation

If you apply this library to any project and research, please cite our code:

@misc{torchgpipe,
  author       = {Kakao Brain},
  title        = {torchgpipe, {A} {GPipe} implementation in {PyTorch}},
  howpublished = {url{https://github.com/kakaobrain/torchgpipe}},
  year         = {2019}
}

上一篇：pywick

下一篇：hub

用户评价

全部评价

还没有评论，说两句吧！

热门资源

TensorFlow-Course

This repository aims to provide simple and read...
seetafaceJNI

项目介绍基于中科院seetaface2进行封装的JAVA...
mxnet_VanillaCNN

This is a mxnet implementation of the Vanilla C...
vsepp_tensorflow

Improving Visual-Semantic Embeddings with Hard ...
DuReader_QANet_BiDAF

Machine Reading Comprehension on DuReader Usin...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com