gpu-rest-engine
This repository shows how to implement a REST server for low-latency image classification (inference) using NVIDIA GPUs. This is an initial demonstration of the GRE (GPU REST Engine) software that will allow you to build your own accelerated microservices.
This repository is a demo, it is not intended to be a generic solution that can accept any trained model. Code customization will be required for your use cases.
This demonstration makes use of several technologies with which you may be familiar:
Docker: for bundling all the dependencies of our program and for easier deployment.
Go: for its efficient builtin HTTP server.
Caffe: because it has good performance and a simple C++ API.
TensorRT: NVIDIA's high-performance inference engine.
cuDNN: for accelerating common deep learning primitives on the GPU.
OpenCV: to have a simple C++ API for GPU image processing.
A Kepler or Maxwell NVIDIA GPU with at least 2 GB of memory.
A Linux system with recent NVIDIA drivers (recommended: 352.79).
Install the latest version of Docker.
Install nvidia-docker.
The command might take a while to execute:
$ docker build -t inference_server -f Dockerfile.caffe_server .
To speedup the build you can modify this line to only build for the GPU architecture that you need.
This command requires the TensorRT archive to be present in the current folder.
$ docker build -t inference_server -f Dockerfile.tensorrt_server .
Execute the following command and wait a few seconds for the initialization of the classifiers:
$ docker run --runtime=nvidia --name=server --net=host --rm inference_server
You can use the environment variable NVIDIA_VISIBLE_DEVICES
to isolate GPUs for this container.
Since we used --net=host
, we can access our inference server from a terminal on the host using curl
:
$ curl -XPOST --data-binary @images/1.jpg http://127.0.0.1:8000/api/classify [{"confidence":0.9998,"label":"n02328150 Angora, Angora rabbit"},{"confidence":0.0001,"label":"n02325366 wood rabbit, cottontail, cottontail rabbit"},{"confidence":0.0001,"label":"n02326432 hare"},{"confidence":0.0000,"label":"n02085936 Maltese dog, Maltese terrier, Maltese"},{"confidence":0.0000,"label":"n02342885 hamster"}]
We can benchmark the performance of our classification server using any tool that can generate HTTP load. We included a Dockerfile for a benchmarking client using rakyll/hey:
$ docker build -t inference_client -f Dockerfile.inference_client . $ docker run -e CONCURRENCY=8 -e REQUESTS=20000 --net=host inference_client
If you have Go
installed on your host, you can also benchmark the server with a client outside of a Docker container:
$ go get github.com/rakyll/hey $ hey -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify
This machine has 4 GeForce GTX Titan X GPUs:
$ hey -c 8 -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify Summary: Total: 100.7775 secs Slowest: 0.0167 secs Fastest: 0.0028 secs Average: 0.0040 secs Requests/sec: 1984.5690 Total data: 68800000 bytes Size/request: 344 bytes [...]
As a comparison, Caffe in standalone mode achieves approximately 500 images / second on a single Titan X for inference (batch=1
).
This shows that our code achieves optimal GPU utilization and good
multi-GPU scaling, even when adding a REST API on top. A discussion of
GPU performance for inference at different batch sizes can be found in
our GPU-Based Deep Learning Inference whitepaper.
This inference server is aimed for low-latency applications, to achieve higher throughput we would need to batch multiple incoming client requests, or have clients send multiple images to classify. Batching can be added easily when using the C++ API of Caffe. An example of this strategy can be found in this article from Baidu Research, they call it "Batch Dispatch".
Similarly to the inference server, a simple server code is provided
for estimating the overhead of using CUDA kernels in your code. The
server will simply call an empty CUDA kernel before responding 200
to the client. The server can be built using the same commands as above:
$ docker build -t benchmark_server -f Dockerfile.benchmark_server . $ docker run --runtime=nvidia --name=server --net=host --rm benchmark_server
And for the client:
$ docker build -t benchmark_client -f Dockerfile.benchmark_client . $ docker run -e CONCURRENCY=8 -e REQUESTS=200000 --net=host benchmark_client [...] Summary: Total: 5.8071 secs Slowest: 0.0127 secs Fastest: 0.0001 secs Average: 0.0002 secs Requests/sec: 34440.3083
Feel free to report issues during build or execution. We also welcome suggestions to improve the performance of this application.
上一篇:online-softmax
下一篇:winex_lgpl
还没有评论,说两句吧!
热门资源
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com