https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8z10Myto4Wrft2QtaivpPFK41_9fela5y4qzPaP4EPhXUiBy2quCJCof3tLsVa7YCmqBowL7MtompHzTGblM4zkwP-nPw3coqsZfgf7hLnYdBeJ3ykkrlSF0Zl80yA3J_VLM2IkDQahM/s1600/q1.gif
Posted by the TensorFlow team
We are very excited to add post-training
float16 quantization as part of the
Model Optimization Toolkit. It is a suite of tools that includes
hybrid quantization,
full integer quantization, and
pruning. Check out what else is on the
roadmap.
Post-training float16 quantization reduces TensorFlow Lite model sizes (up to 50%), while sacrificing very little accuracy. It
quantizes model constants (like weights and bias values) from full precision floating point (32-bit) to a reduced precision floating point data type (IEEE FP16).
Post-training float16 quantization is a good place to get started in quantizing your TensorFlow Lite models because of its minimal impact on accuracy and significant decrease in model size. You can check out our documentation
here (including a new float chart!) to help walk you through the different quantization options and scenarios.
Benefits of reduced precision
There are multiple benefits to reduced precision, especially when deploying to the edge:
- 2x reduction in model size. All constant values in the model are stored in 16-bit floats instead of 32-bit floats. Since these constant values typically dominate the overall model size, this usually reduces the size of the model by about half.
- Negligible accuracy loss. Deep learning models are frequently able to produce good results on inference while using fewer bits of precision than they were originally trained with. In our experimentation across several models we found little loss in inference quality. For example, below we show a <0.03% reduction in Top 1 accuracy for MobileNet V2. (see results below).
2x reduction in size, negligible accuracy tradeoff
Post-training float16 quantization has minimal impact on accuracy and results in ~2x reduction in size for deep learning models. For example, here are some results for MobileNet V1 and V2 models and a MobileNet SSD model. The accuracy results for MobileNet v1 and v2 are based on the
ImageNet image recognition task. The SSD model was evaluated on the
COCO object recognition task.
Model accuracy
The standard Mobilenet float32 models (and fp16 variants) were evaluated on the
ILSVRC 2012 image classification task. The Mobilenet SSD float32 model and its fp16 variant were evaluated on the
COCO Object Detection task.
How to enable post-training float16 quantization
You can specify post-training float16 quantization on the TensorFlow Lite converter by taking your trained float32 model, setting the optimization to DEFAULT, and the supported types of the target spec to the float16 constant:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]
Tflite_quanit_model = converter.convert()
Once your model is converted, you can run it directly, just like any other TensorFlow Lite model. By default, the model will run on the CPU by “upsampling” the 16-bit parameters to 32-bits, and then performing operations in standard 32-bit floating point arithmetic. Over time, we expect to see more hardware support for accelerated fp16 calculations, allowing us to drop the upsample to float32 and compute directly in these half precision values.
You can also run your model on the GPU. We’ve enhanced the TensorFlow Lite GPU delegate to take in the reduced precision parameters and run with them directly (instead of converting to float32 as is done on the CPU). In your app, you create the GPU delegate via the
TfLiteGpuDelegateCreate function (
documentation). When specifying the options for the delegate, be sure to set
precision_loss_allowed to 1 to use float16 operations on the GPU:
//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
.metadata = NULL,
.compile_options = {
.precision_loss_allowed = 1, // FP16
.preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
.dynamic_batch_enabled = 0, // Not fully functional yet
},
};
For an overview of the GPU delegate, see our previous
post. Check out a working example of using float16 quantization in this
colab tutorial.
Acknowledgements
T.J. Alumbaugh, Andrei Kulik, Juhyun Lee, Jared Duke, Raziel Alvarez, Sachin Joglekar, Jian Li, Yunlu Li, Suharsh Sivakumar, Nupur Garg, Lawrence Chan, Andrew Selle.