TensorFlow Model Optimization Toolkit — float16 quantization halves model size
sierpnia 05, 2019
Posted by the TensorFlow team

We are very excited to add post-training float16 quantization as part of the Model Optimization Toolkit. It is a suite of tools that includes hybrid quantization, full integer quantization, and pruning. Check out what else is on the roadmap.

Post-training float16 quantization reduces TensorFlow Lite model sizes (up to 50%), while sacrificing very little accuracy. It quantizes model constants (like weights and bias values) from full precision floating point (32-bit) to a reduced precision floating point data type (IEEE FP16).

Post-training float16 quantization is a good place to get started in quantizing your TensorFlow Lite models because of its minimal impact on accuracy and significant decrease in model size. You can check out our documentation here (including a new float chart!) to help walk you through the different quantization options and scenarios.

Picture of mushrooms with the image size reduced by quantization

Benefits of reduced precision

There are multiple benefits to reduced precision, especially when deploying to the edge:
  • 2x reduction in model size. All constant values in the model are stored in 16-bit floats instead of 32-bit floats. Since these constant values typically dominate the overall model size, this usually reduces the size of the model by about half.
  • Negligible accuracy loss. Deep learning models are frequently able to produce good results on inference while using fewer bits of precision than they were originally trained with. In our experimentation across several models we found little loss in inference quality. For example, below we show a <0.03% reduction in Top 1 accuracy for MobileNet V2. (see results below).

2x reduction in size, negligible accuracy tradeoff

Post-training float16 quantization has minimal impact on accuracy and results in ~2x reduction in size for deep learning models. For example, here are some results for MobileNet V1 and V2 models and a MobileNet SSD model. The accuracy results for MobileNet v1 and v2 are based on the ImageNet image recognition task. The SSD model was evaluated on the COCO object recognition task.

Model accuracy

The standard Mobilenet float32 models (and fp16 variants) were evaluated on the ILSVRC 2012 image classification task. The Mobilenet SSD float32 model and its fp16 variant were evaluated on the COCO Object Detection task.


How to enable post-training float16 quantization

You can specify post-training float16 quantization on the TensorFlow Lite converter by taking your trained float32 model, setting the optimization to DEFAULT, and the supported types of the target spec to the float16 constant:


import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]
Tflite_quanit_model = converter.convert()


Once your model is converted, you can run it directly, just like any other TensorFlow Lite model. By default, the model will run on the CPU by “upsampling” the 16-bit parameters to 32-bits, and then performing operations in standard 32-bit floating point arithmetic. Over time, we expect to see more hardware support for accelerated fp16 calculations, allowing us to drop the upsample to float32 and compute directly in these half precision values.

You can also run your model on the GPU. We’ve enhanced the TensorFlow Lite GPU delegate to take in the reduced precision parameters and run with them directly (instead of converting to float32 as is done on the CPU). In your app, you create the GPU delegate via the TfLiteGpuDelegateCreate function (documentation). When specifying the options for the delegate, be sure to set precision_loss_allowed to 1 to use float16 operations on the GPU:


//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
  .metadata = NULL,
  .compile_options = {
    .precision_loss_allowed = 1,  // FP16
    .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
    .dynamic_batch_enabled = 0,   // Not fully functional yet
  },
};


For an overview of the GPU delegate, see our previous post. Check out a working example of using float16 quantization in this colab tutorial.

We encourage you to give this a try right away and give us your feedback. Share your use case directly or on Twitter as #TFLite and #PoweredByTF.

Acknowledgements
T.J. Alumbaugh, Andrei Kulik, Juhyun Lee, Jared Duke, Raziel Alvarez, Sachin Joglekar, Jian Li, Yunlu Li, Suharsh Sivakumar, Nupur Garg, Lawrence Chan, Andrew Selle.
Next post
TensorFlow Model Optimization Toolkit — float16 quantization halves model size

Posted by the TensorFlow team

We are very excited to add post-training float16 quantization as part of the Model Optimization Toolkit. It is a suite of tools that includes hybrid quantization, full integer quantization, and pruning. Check out what else is on the roadmap.

Post-training float16 quantization reduces TensorFlow Lite model sizes (up to 50%), while sacrificing very little accuracy. It