Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference
Abstract
The rising popularity of intelligent mobile devices and
the daunting computational cost of deep learning-based
models call for efficient and accurate on-device inference
schemes. We propose a quantization scheme that allows
inference to be carried out using integer-only arithmetic,
which can be implemented more efficiently than floating
point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve
end-to-end model accuracy post quantization. As a result,
the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements
are significant even on MobileNets, a model family known
for run-time efficiency, and are demonstrated in ImageNet
classification and COCO detection on popular CPUs