Introduction

This repository shows how to implement a REST server for low-latency image classification (inference) using NVIDIA GPUs. This is an initial demonstration of the GRE (GPU REST Engine) software that will allow you to build your own accelerated microservices.

This repository is a demo, it is not intended to be a generic solution that can accept any trained model. Code customization will be required for your use cases.

This demonstration makes use of several technologies with which you may be familiar:

Docker: for bundling all the dependencies of our program and for easier deployment.
Go: for its efficient builtin HTTP server.
Caffe: because it has good performance and a simple C++ API.
TensorRT: NVIDIA's high-performance inference engine.
cuDNN: for accelerating common deep learning primitives on the GPU.
OpenCV: to have a simple C++ API for GPU image processing.

Building

Prerequisites

A Kepler or Maxwell NVIDIA GPU with at least 2 GB of memory.
A Linux system with recent NVIDIA drivers (recommended: 352.79).
Install the latest version of Docker.
Install nvidia-docker.

Build command (Caffe)

The command might take a while to execute:

$ docker build -t inference_server -f Dockerfile.caffe_server .

To speedup the build you can modify this line to only build for the GPU architecture that you need.

Build command (TensorRT)

This command requires the TensorRT archive to be present in the current folder.

$ docker build -t inference_server -f Dockerfile.tensorrt_server .

Testing

Starting the server

Execute the following command and wait a few seconds for the initialization of the classifiers:

$ docker run --runtime=nvidia --name=server --net=host --rm inference_server

You can use the environment variable NVIDIA_VISIBLE_DEVICES to isolate GPUs for this container.

Single image

Since we used --net=host, we can access our inference server from a terminal on the host using curl:

$ curl -XPOST --data-binary @images/1.jpg http://127.0.0.1:8000/api/classify
[{"confidence":0.9998,"label":"n02328150 Angora, Angora rabbit"},{"confidence":0.0001,"label":"n02325366 wood rabbit, cottontail, cottontail rabbit"},{"confidence":0.0001,"label":"n02326432 hare"},{"confidence":0.0000,"label":"n02085936 Maltese dog, Maltese terrier, Maltese"},{"confidence":0.0000,"label":"n02342885 hamster"}]

Benchmarking performance

We can benchmark the performance of our classification server using any tool that can generate HTTP load. We included a Dockerfile for a benchmarking client using rakyll/hey:

$ docker build -t inference_client -f Dockerfile.inference_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=20000 --net=host inference_client

If you have Go installed on your host, you can also benchmark the server with a client outside of a Docker container:

$ go get github.com/rakyll/hey
$ hey -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify

Performance on a NVIDIA DIGITS DevBox

This machine has 4 GeForce GTX Titan X GPUs:

$ hey -c 8 -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify
Summary:
  Total:        100.7775 secs
  Slowest:      0.0167 secs
  Fastest:      0.0028 secs
  Average:      0.0040 secs
  Requests/sec: 1984.5690
  Total data:   68800000 bytes
  Size/request: 344 bytes
[...]

As a comparison, Caffe in standalone mode achieves approximately 500 images / second on a single Titan X for inference (batch=1). This shows that our code achieves optimal GPU utilization and good multi-GPU scaling, even when adding a REST API on top. A discussion of GPU performance for inference at different batch sizes can be found in our GPU-Based Deep Learning Inference whitepaper.

This inference server is aimed for low-latency applications, to achieve higher throughput we would need to batch multiple incoming client requests, or have clients send multiple images to classify. Batching can be added easily when using the C++ API of Caffe. An example of this strategy can be found in this article from Baidu Research, they call it "Batch Dispatch".

Benchmarking overhead of CUDA kernel calls

Similarly to the inference server, a simple server code is provided for estimating the overhead of using CUDA kernels in your code. The server will simply call an empty CUDA kernel before responding 200 to the client. The server can be built using the same commands as above:

$ docker build -t benchmark_server -f Dockerfile.benchmark_server .
$ docker run --runtime=nvidia --name=server --net=host --rm benchmark_server

And for the client:

$ docker build -t benchmark_client -f Dockerfile.benchmark_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=200000 --net=host benchmark_client
[...]
Summary:
  Total:        5.8071 secs
  Slowest:      0.0127 secs
  Fastest:      0.0001 secs
  Average:      0.0002 secs
  Requests/sec: 34440.3083