https://blog.tensorflow.org/2020/02/matrix-compression-operator-tensorflow.html

TensorFlow Core

https://2.bp.blogspot.com/-mo0yS1F3xtg/XkXGC5bjf1I/AAAAAAAACsU/ifq7sDBQRX8pQ5xWC6z1Y8jItT9C_a8TACLcBGAsYHQ/s1600/pruning.png

February 15, 2020 —
*Posted by Rina Panigrahy*

Matrix Compression
Tensors and matrices are the building blocks of machine learning models -- in particular deep networks. It is often necessary to have tiny models so that they may fit on devices such as phones, home assistants, auto, thermostats -- this not only helps mitigate issues of network availability, latency and power consumption but is also desirable by end use…

Matrix Compression Operator

The commonly used random projections which involves using a random projection matrix to project the input into a smaller dimensional vector and then training a thinner matrix is simply a special case of matrix factorization where one of the factors is random.

Thus there is a continuum from pruning to quantization to dictionary learning and they are all forms of matrix factorization just as low-rank-approximation.

We have open sourced such a compression operator that can take any custom matrix factorization method specified in a certain MatrixCompressor Class. Then to apply a certain compression method one simply calls apply_compression(M, compressor = myCustomFactorization). The operator dynamically replaces a single matrix A by a product of two matrices B*C that is obtained by factoring A by the specified custom factorization algorithm. The operator in real time lets the matrix A train for some time and at a certain training snapshot applies the factorization algorithm and replaces A by the factors B*C and then continues training the factors (the product ‘*’ need not be the standard matrix multiplication but can be any custom method specified in the compression class). The Compression operator can take any of a number of factorization methods mentioned before.

For the dictionary learning method we even have an OMP based implementation of dictionary learning that is faster than the scikit implementation. We also have an improved gradient based pruning method that not only takes into account the magnitude of the entries to decide which ones to prune but also its effect on the final loss by measuring the gradient of the loss with respect to the entry.

Thus we are performing a mutation of the network in real time where in the beginning there is only one matrix and in the end this produces two matrices in the layer. Our factorization methods need not even be gradient based but may involve more discrete style algorithms such as hashing, OMP for dictionary learning, or clustering for k-means. Thus our operator demonstrates that it is possible to mix continuous gradient based methods with more traditional discrete algorithms. The experiments also included a method based on random projections called simhash -- the matrix to be compressed is multiplied by a random projection matrix and the entries are rounded to binary values -- thus it is a factorization into one random projection matrix and a binary matrix. The following plots show how these algorithms perform on compressing models for CIFAR10 and PTB. The results show that while low-rank-approximation beats simhash and k-means on CIFAR10, on PTB dictionary learning is slightly better than low-rank-approximation.

Next post

TensorFlow Core

Matrix Compression Operator

February 15, 2020
—
*Posted by Rina Panigrahy*

Matrix Compression
Tensors and matrices are the building blocks of machine learning models -- in particular deep networks. It is often necessary to have tiny models so that they may fit on devices such as phones, home assistants, auto, thermostats -- this not only helps mitigate issues of network availability, latency and power consumption but is also desirable by end use…

Build, deploy, and experiment easily with TensorFlow