September 09, 2021 — Posted by Marat Dukhan and Frank Barchard, software engineers Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Today, we are extending the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on AR…
Posted by Marat Dukhan and Frank Barchard, software engineers
Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Today, we are extending the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on ARM64 mobile phones, 5X speedup on x86-64 laptop and desktop systems, and 20X speedup for in-browser inference with WebAssembly SIMD compared to the default TensorFlow Lite quantized kernels.
Quantized inference in XNNPACK is optimized for symmetric quantization schemas used by the TensorFlow Model Optimization Toolkit. XNNPACK supports both the traditional per-tensor quantization schema and the newer accuracy-optimized schema with per-channel quantization of weights and per-tensor quantization of activations. Additionally, XNNPACK supports the asymmetric quantization schema, albeit with reduced efficiency.
We evaluated XNNPACK-acclerated quantized inference on a number of edge devices and neural network architectures. Below, we present benchmarks on four public and two internal quantized models covering common computer vision tasks:
Speedup from XNNPACK on single-threaded inference of quantized computer vision models on Android/ARM64 mobile phones. |
Across the six Android ARM64 mobile devices XNNPACK delivers, on average, 30% speedup over the default TensorFlow Lite quantized kernels.
Speedup from XNNPACK on single-threaded inference of quantized computer vision models on x86-64 laptop and desktop systems. |
XNNPACK offers even greater improvements on laptop and desktop systems with x86 processors. On the 5 x86 processors in our benchmarks XNNPACK accelerated inference on average by 5 times. Notably, low-end and older processors which don’t support AVX instructions see over 20X speedup from switching quantized inference to XNNPACK: while the previous TensorFlow Lite inference backend had optimized implementations only for AVX, AVX2, and AVX512 instruction sets, XNNPACK provides optimized implementations for all x86-64 processors.
Speedup from XNNPACK on single-threaded WebAssembly SIMD inference of quantized computer vision models on mobile phones, laptops, and desktops when running through V8. |
Besides the traditional mobile and laptop/desktop platforms, XNNPACK brings accelerated quantized inference to the Web platform through the TensorFlow Lite Web API. The above plot demonstrates a geomean speedup of 20X over the default TensorFlow Lite implementation when running WebAssembly SIMD benchmarks through the V8 JavaScript engine on 3 x86-64 and 2 ARM64 systems.
XNNPACK started its life as a fork of QNNPACK library, but as the first version of XNNPACK focused on floating-point inference and QNNPACK focused on quantized inference, it was not possible to compare the two. Now with XNNPACK introducing support for quantized inference, we can directly evaluate and attribute the two further years of performance optimizations.
To compare the two quantized inference backends, we ported randomized MobileNet v1 and MobileNet v2 models from XNNPACK API to QNNPACK API, and benchmarked their single-threaded performance on two ARM64 Android phones and two x86-64 systems. The results are presented in the plot above, and the progress made by XNNPACK in two years is striking. XNNPACK is 50% faster on the older Pixel 3a phone and 4-5X faster on the newer Pixel 4a phone, 2.5X faster on the x86-64 laptop, and over 3X faster on the x86-64 workstation. These improvements are the result of a multiple optimizations XNNPACK gained in the two years since it forked from QNNPACK:
[-127, 127]
range (-128
is excluded even though it can be represented as a signed 8-bit integer). Symmetric quantization offers two computational advantages exploited in XNNPACK. First, when the filter weights are static, the results of accumulating the product of input zero point by the filter weights can be completely fused into the bias term in the Convolution, Deconvolution, and Fully Connected operators. Thus, zero point parameters are completely absent from the inference computations. Secondly, the product of a signed 8-bit input element by the weight element restricted to [-127, 127]
fits into 15 bits. This enables the microkernels for Convolution, Deconvolution, and Fully Connected operators to do half of the accumulations on 16-bit variables rather than always extending the products to 32 bits.Taken together, these optimizations make XNNPACK the new state-of-art for quantized inference, and turn TensorFlow Lite into the most versatile quantized inference solution, covering systems from Raspberry Pi Zero to Chromebooks to workstations with server-class processors.
Quantized XNNPACK inference is enabled by default in the CMake builds of TensorFlow Lite for all platforms, in the Bazel builds of TensorFlow Lite for the Web platform, and will be available in TensorFlow Lite Web API in the 2.7 release. In Bazel builds for other platforms, quantized XNNPACK inference is enabled via a build-time opt-in mechanism. When building TensorFlow Lite with Bazel, add --define tflite_with_xnnpack=true --define xnn_enable_qs8=true
, and the TensorFlow Lite interpreter will use the XNNPACK backend by default for supported operators with symmetric quantization. Limited support for operators with asymmetric quantization is available via the --define xnn_enable_qu8=true
Bazel option.
The XNNPACK backend currently supports a subset of quantized TensorFlow Lite operators (see documentation for details and limitations). XNNPACK supports models produced by the Model Optimization Toolkit through post-training integer quantization and quantization-aware training, but not post-training dynamic range quantization.
This is the third version of the XNNPACK integration into TensorFlow Lite following the initial release of the floating-point implementation and the subsequent release that brought sparse inference support. In the following versions we plan to add the following improvements:
We encourage you to leave your thoughts and comments on our GitHub and StackOverflow pages, and you can ask questions on discuss.tensorflow.org
September 09, 2021 — Posted by Marat Dukhan and Frank Barchard, software engineers Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Today, we are extending the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on AR…