https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html?hl=da
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXGkZ8nQkw4HB-6yIjSaJl6jaPh-hqqLfzzG6dcFc7ee78NQppRPdKkTvbF_tI3CANXEpUGv1WlfgVaMdIUbnLGNTq_4WCxUHAd1HlN7fahmlNtkpkF0wIE9wo2XJ4X-O68bf9AKrgKmY/s1600/clustering_presentation.gif
A guest post by Mohamed Nour Abouelseoud, and
 Anton Kachatkou at Arm
We are excited to introduce a weight clustering API, proposed and contributed by Arm, to the TensorFlow
 Model Optimization Toolkit. 
Weight clustering is a technique to reduce the storage and transfer size of your model by replacing many unique parameter values with a smaller number of unique values. This benefit applies to all deployments. Along with framework and hardware-specific support, such as in the Arm 
Ethos-N and 
Ethos-U machine learning processors, weight clustering can additionally improve memory footprint and inference speed.
This work is part of the 
toolkit's roadmap to support the development of smaller and faster ML models. You can see previous posts on 
post-training quantization, 
quantization-aware training, and 
sparsity for more background on the toolkit and what it can do. 
Arm and the TensorFlow team have been collaborating in this space to improve deployment to mobile and IoT devices. 
What is weight clustering?
Increasingly, Deep Learning applications are moving into more resource-constrained environments, 
from smartphones to agricultural sensors and medical instruments. This shift into resource-constrained environments led to efforts for smaller and more efficient model architectures as well as increased emphasis on model optimization techniques such as 
pruning and
 quantization.
Weight clustering is an optimization algorithm to reduce the storage and network transfer size of your model. The idea in a nutshell is explained in the diagram below. 

Here’s an explanation of the diagram. Imagine, for example, that a layer in your model contains a 4x4 matrix of weights (represented by the “weight matrix” above). Each weight is stored using a float32 value. When you save the model, you are storing 16 unique float32 values to disk. 
Weight clustering reduces the size of your model by replacing similar weights in a layer with the same value. These values are found by running a clustering algorithm over the model’s trained weights. The user can specify the number of clusters (in this case, 4). This step is shown in “Get centroids” in the diagram above, and the 4 centroid values are shown in the “Centroid” table. Each centroid value has an index (0-3).
Next, each weight in the weight matrix is replaced with its centroid’s index. This step is shown in “Assign indices”. Now, instead of storing the original weight matrix, the weight clustering algorithm can store the modified matrix shown in “Pull indices” (containing the index of the centroid values), and the centroid values themselves. 
In this case, we have reduced the size from 16 unique floats, to 4 floats and 16 2-bit indices. The savings increase with larger matrix sizes.
Note that even if we still stored 16 floats, they now have just 4 distinct values. Common compression tools (like zip) can now take advantage of the redundancy in the data to achieve higher compression.
The technical implementation of clustering is derived from 
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. See the paper for additional details on the gradient update and weight retrieval.
Clustering is available through a simple Keras API, in which any Keras model (or layer) can be wrapped and fine-tuned. See usage examples below.
Advantages of weight clustering
Weight clustering has an immediate advantage in reducing model storage and transfer size across serialization formats, as a model with shared parameters has a much higher compression rate than one without. This is similar to a sparse (pruned) model, except that the compression benefit is achieved through reducing the number of unique weights, while pruning achieves it through setting weights below a certain threshold to zero. Once a Keras model is clustered, the benefit of the reduced size is available by passing it through any common compression tool.
To further unlock the improvements in memory usage and speed at inference time associated with clustering, specialized run-time or compiler software and dedicated machine learning hardware is required. Examples include the 
Arm ML Ethos-N driver stack for the 
Ethos-N processor and the 
Ethos-U Vela compiler for the 
Ethos-U processor. Both examples currently require quantizing and converting optimized Keras models to TensorFlow Lite first.
Clustering can be done on its own or as part of a cascaded
 Deep Compression optimization pipeline to achieve further size reduction and inference speed. 
Compression and accuracy results
Experiments were run on several popular models, demonstrating compression benefits of weight clustering. More aggressive optimizations can be applied, but at the cost of accuracy. Though the table below includes measurements for TensorFlow Lite models, similar benefits are observed for other serialization formats such as SavedModel. 

The table below demonstrates how clustering was configured to achieve the results. Some models were more prone to accuracy degradation from aggressive clustering, in which case selective clustering was used on layers that are more robust to optimization. 
 
Clustering a model
The clustering API is available in the TensorFlow Model Optimization Toolkit starting from release 
v0.4.0. To cluster a model, it needs to be fully trained first before passing it to the clustering API. A snippet of full model clustering is shown below.
import tensorflow_model_optimization as tfmot
cluster_weights = tfmot.clustering.keras.cluster_weights
pretrained_model = pretrained_model()
clustering_params = {
    'number_of_clusters': 32,
    'cluster_centroids_init': tfmot.clustering.keras.CentroidInitialization.LINEAR
}
clustered_model = cluster_weights(pretrained_model, **clustering_params)
# Fine-tune
clustered_model.fit(...)
# Prepare model for serving by removing training-only variables.
model_for_serving = tfmot.clustering.keras.strip_clustering(clustered_model)
...
To cluster select layers in a model, you can apply the same clustering method to those layers when constructing a model.
clustered_model = tf.keras.Sequential([
      Dense(...),
      cluster_weights(Dense(...,
                      kernel_initializer=pretrained_weights,
                      bias_initializer=pretrained_bias),
                      **clustering_params),
      Dense(...)
  ])
When selectively clustering a layer, it still needs to have been fully trained; therefore, we use the layer’s  
kernel_initializer parameter to initialize the weights. 
Using tf.keras.models.clone_model is another option.
Documentation
To learn more about how to use the API, you can try this simple 
end-to-end clustering example colab to start. A more comprehensive guide with additional tips can be found 
here.
Acknowledgments
The feature and results presented in this post are the work of many people including the Arm ML Tooling team and our collaborators in Google’s TensorFlow Model Optimization Toolkit team.
From Arm - Anton Kachatkou, Aron Virginas-Tar, Ruomei Yan, Konstantin Sofeikov, Saoirse Stewart, Peng Sun, Elena Zhelezina, Gergely Nagy, Les Bell, Matteo Martincigh, Grant Watson, Diego Russo, Benjamin Klimczak, Thibaut Goetghebuer-Planchon.
From Google - Alan Chiao, Raziel Alvarez