资源算法torchgpipe

torchgpipe

2019-10-10 | |  114 |   0 |   0

torchgpipe 

PyPI Build Status Coverage Status Documentation Status Korean README

GPipe implementation in PyTorch. It is optimized for CUDA rather than TPU.

from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

What is GPipe?

GPipe is a scalable pipeline parallelism library published by Google Brain, which allows efficient training of large, memory-consuming models. According to the paper, GPipe can train a 25x larger model by using 8x devices (TPU), and train a model 3.5x faster by using 4x devices.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Google trained AmoebaNet-B with 557M parameters over GPipe. This model has achieved 84.3% top-1 and 97.0% top-5 accuracy on ImageNet classification benchmark (the state-of-the-art performance as of May 2019).

GPipe uses (a) pipeline parallelism and (b) automatic recomputation of the forward propagation during the backpropagation, hence leverages training a large model. We refer to (b) as checkpointing, following the well-known terminology in PyTorch community.

  • Pipeline Parallelism

  • GPipe splits a model into multiple partitions and places each partition on a different device to occupy more memory capacity. And it splits a mini-batch into multiple micro-batches to make the partitions work as parallel as possible.

  • Checkpointing

  • Checkpointing is applied to each partition to minimize the overall memory consumption by a model. During forward propagation, only the tensors at the boundaries between partitions are remembered. All other intermediate tensors are volatilized, and recomputed during backpropagation when necessary.

Usage

Currently, torchgpipe requires the following environments:

  • Python 3.6+

  • PyTorch 1.1+

To use torchgpipe, install it via PyPI:

$ pip install torchgpipe

To train a module with GPipe, simply wrap it with torchgpipe.GPipe. Your module must be nn.Sequential as GPipe will automatically split the module into partitions with consecutive layers. balance argument determines the number of layers in each partition. chunks argument specifies the number of micro-batches. Input, output, and intermediate tensors must be Tensor or Tuple[Tensor, ...].

The below example code shows how to split a module with four layers into four partitions each having a single layer. This code also splits a mini-batch into 8 micro-batches:

from torchgpipe import GPipe

model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)for input in data_loader:
    output = model(input)

Documentation

Visit torchgpipe.readthedocs.io for more information including the API references.

Benchmarking

ResNet-101 Speed Benchmark

ExperimenttorchgpipeGPipe (original)
naive-11x1x
pipeline-10.756x0.8x
pipeline-21.489x1.418x
pipeline-42.629x2.182x
pipeline-84.367x2.891x

The table shows the reproduced speed benchmark on ResNet-101, as stated by reported in Figure 3(b) of the paper.

Naive-1 indicates the baseline setting that ResNet-101 on a single device is trained without GPipe. The speeds under other settings are measured relative to the speed of naive-1 (which is considered as the unit speed). Pipeline-k means k partitions with GPipe using k devices. Pipeline-1 is slower than naive-1 since it does not benefit from pipeline parallelism but has checkpointing overhead.

The reproducible code can be found in examples/resnet101_speed_benchmark.

ResNet-101 Accuracy Benchmark

Batch sizetorchgpipenn.DataParallelGoyal et al.
25621.99±0.1322.02±0.1122.08±0.06
1k22.24±0.1922.04±0.24N/A
4k22.13±0.09N/AN/A

The table shows the reproduced accuracy(top-1 error rate) benchmark on ResNet-101, as stated by reported in Table 2(c) of Accurate, Large Minibatch SGD paper.

The reproducible code can be found in examples/resnet101_accuracy_benchmark.

AmoebaNet-D Speed Benchmark

ExperimenttorchgpipeGPipe (original)
naive-21x1x
pipeline-21.434x1.156x
pipeline-42.049x2.483x
pipeline-82.424x3.442x

The table shows the reproduced speed benchmark on AmoebaNet-D, as reported in Figure 3(a) of the paper. But there is some difference between torchgpipe and GPipe. We believe that this difference is not caused by the difference of torchgpipe and GPipe, rather by reimplementing the AmoebaNet-D model in TensorFlow for PyTorch. Results will be updated whenever a stable and reproducible AmoebaNet-D in PyTorch is available.

Naive-2 indicates the baseline setting that AmoebaNet-D on two devices is trained without GPipe. Pipeline-2 is a little faster than the paper, but pipeline-4 and pipeline-8 are slower.

AmoebaNet-D Memory Benchmark

Experimentnaive-1pipeline-1pipeline-2pipeline-4pipeline-8
torchgpipeGPipe
(original)
torchgpipeGPipe
(original)
torchgpipeGPipe
(original)
torchgpipeGPipe
(original)
torchgpipeGPipe
(original)
AmoebaNet-D (L, F)(6, 208)(6, 416)(6, 544)(12, 544)(24, 512)
# of Model Parameters90M82M358M318M613M542M1.16B1.05B2.01B1.80B
Total Peak Model Parameter Memory1.00GB1.05GB4.01GB3.80GB6.45GB6.45GB13.00GB12.53GB22.42GB24.62GB
Total Peak Activation Memory-6.26GB6.64GB3.46GB11.31GB8.11GB18.72GB15.21GB35.78GB26.24GB

It shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of the paper. The size of an AmoebaNet-D model is determined by two hyperparameters L and F which are proportional to the number of layers and filters, respectively.

The difference between naive-1 and pipeline-1 indicates GPipe's capability to leverage training a larger model. With 8 GPUs, GPipe is capable of training a model which is 22 times larger compared to the naive-1 setting.

Notes

This project is functional, but the interface is not confirmed yet. All public APIs are subject to change without warning until v0.1.0.

Authors and Licensing

torchgpipe project is developed by Heungsub LeeMyungryong Jeong, and Chiheon Kim at Kakao Brain, with Sungbin LimIldoo Kim, and Woonhyuk Baek's help. It is distributed under Apache License 2.0.

Citation

If you apply this library to any project and research, please cite our code:

@misc{torchgpipe,
  author       = {Kakao Brain},
  title        = {torchgpipe, {A} {GPipe} implementation in {PyTorch}},
  howpublished = {url{https://github.com/kakaobrain/torchgpipe}},
  year         = {2019}
}


上一篇:pywick

下一篇:hub

用户评价
全部评价

热门资源

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...