Abstract
We propose precision gating (PG), an end-to-end trainable dual-precision quantization technique for deep neural networks. PG computes most features in a low precision and only a small proportion of important features in a higher precision. Precision gating is very lightweight and widely applicable to many neural network architectures. Experimental results show that precision gating can greatly reduce the average bitwidth of computations in both CNNs and LSTMs with negligible accuracy loss. Compared to state-of-the-art counterparts, PG achieves the same or better accuracy with 2.4× less compute on ImageNet. Compared to 8-bit uniform quantization, PG obtains a 1.2% improvement in perplexity per word with 2.8× computational cost reduction on LSTM on the Penn Tree Bank dataset. Precision gating has the potential to greatly reduce the execution costs of DNNs on both commodity and dedicated hardware accelerators. We implement the sampled dense-dense matrix multiplication kernel in PG on CPU, which achieves up to 8.3× wall clock speedup over the dense baseline.