Abstract
Quantization is considered as one of the most effective
methods to optimize the inference cost of neural network
models for their deployment to mobile and embedded systems, which have tight resource constraints. In such approaches, it is critical to provide low-cost quantization under a tight accuracy loss constraint (e.g., 1%). In this paper, we propose a novel method for quantizing weights and
activations based on the concept of weighted entropy. Unlike recent work on binary-weight neural networks, our approach is multi-bit quantization, in which weights and activations can be quantized by any number of bits depending on the target accuracy. This facilitates much more
flexible exploitation of accuracy-performance trade-off provided by different levels of quantization. Moreover, our
scheme provides an automated quantization flow based on
conventional training algorithms, which greatly reduces the
design-time effort to quantize the network. According to
our extensive evaluations based on practical neural network
models for image classification (AlexNet, GoogLeNet and
ResNet-50/101), object detection (R-FCN with ResNet-50),
and language modeling (an LSTM network), our method
achieves significant reductions in both the model size and
the amount of computation with minimal accuracy loss.
Also, compared to existing quantization schemes, ours provides higher accuracy with a similar resource constraint
and requires much lower design effort