https://blog.tensorflow.org/2019/06/tensorflow-integer-quantization.html

https://3.bp.blogspot.com/-iVSrZxsnufE/XdYzEn2C3KI/AAAAAAAABA0/9ogvL5YJ9MQNAbri6dmnwrFNz7okSePJgCLcBGAsYHQ/s1600/banana%2Bgif.gif

June 11, 2019 —
*Posted by the TensorFlow Model Optimization Team*

Since we introduced the Model Optimization Toolkit — a suite of techniques that both novice and advanced developers can use to optimize machine learning models for deployment and execution — we have been working hard to reduce the complexity of quantizing machine learning models.

Initially, we supported post-training quantization via “hybrid operat…

TensorFlow Model Optimization Toolkit — Post-Training Integer Quantization

Since we introduced the Model Optimization Toolkit — a suite of techniques that both novice and advanced developers can use to optimize machine learning models for deployment and execution — we have been working hard to reduce the complexity of quantizing machine learning models.

Initially, we supported post-training quantization via “hybrid operations”, which is quantizing the parameters of the model (i.e. weights), but allowing certain parts of the computation to take place in floating point. Today, we are happy to announce the next addition to our tooling: post-training integer quantization. Integer quantization is a general technique that reduces the numerical precision of the weights and activations of models to reduce memory and improve latency.

Quantize models to reduce size, latency, and power consumption with negligible accuracy loss |

Our new post-training integer quantization enables users to take an already-trained floating-point model and fully quantize it to only use 8-bit signed integers (i.e. `int8`). By leveraging this quantization scheme, we can get reasonable quantized model accuracy across many models without resorting to retraining a model with quantization-aware training. With this new tool, models will continue to be 4x smaller, but will see even greater CPU speed-ups. Fixed point hardware accelerators, such as Edge TPUs, will also be able to run these models.

Compared to quantization-aware training, this tool is much simpler to use, and offers comparable accuracy on most models. There may still be use cases where quantization-aware training is required, but we expect this to be rare as we continue to improve post-training tooling.

In summary, a user should use “hybrid” post training quantization when targeting simple CPU size and latency improvements. When targeting greater CPU improvements or fixed-point accelerators, they should use this integer post training quantization tool, potentially using quantization-aware training if accuracy of a model suffers.

```
def representative_dataset_gen():
data = tfds.load(...)
for _ in range(num_calibration_steps):
image, = data.take(1)
yield [image]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = tf.lite.RepresentativeDataset(
representative_dataset_gen)
```

`converter.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]`

When this flag is used and an operation has no integer quantizable counterpart, the TensorFlow Lite Converter will throw an error.8-bit quantization approximates floating point values using the following formula:

`real_value = (sint8_value — zero_point) * scale.`

Per-axis (also known as “per-channel”) or per-layer weights represented by int8 two’s complement values in the range [-127, 127] with zero-point equal to 0.

Per-layer activations/inputs represented by int8 two’s complement values in the range [-128, 127], with a zero-point in range [-128, 127].

For more details, see the full quantization specification.

Next post

TensorFlow Model Optimization Toolkit — Post-Training Integer Quantization

June 11, 2019
—
*Posted by the TensorFlow Model Optimization Team*

Since we introduced the Model Optimization Toolkit — a suite of techniques that both novice and advanced developers can use to optimize machine learning models for deployment and execution — we have been working hard to reduce the complexity of quantizing machine learning models.

Initially, we supported post-training quantization via “hybrid operat…

Build, deploy, and experiment easily with TensorFlow