https://blog.tensorflow.org/2018/09/achieving-power-efficient-on-device-image-recognition-qualcomm.html?hl=uk
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoKui5NR0OYyIqj1hE3AklCWuG_S0Htr_JXJHgPCODNiY8cjrtsp6jd0upLNe4oWryCoHh6j8Xqaly975EKrJmFrpGUZrGZ9ZLsfQWcjaERlttUt2ReJ1ItxzTThucype31vbyNkZ7C0g/s1600/0_gPtNdC7IRMmVqTAO.png
Achieving Power-efficient On-device Image Recognition — Qualcomm Technologies’ Approach
A guest article by Chen Feng, Terry Sheng, Jay Zhuo, Zhiyu Liang, Parker Zhang and Liang Shen of Qualcomm Technologies
IEEE LPIRC challenge
The 
Low-Power Image Recognition Challenge (LPIRC) is an annual 
competition that evaluates computer vision technologies by accuracy, execution time, and energy consumption. Maximizing the image classification accuracy within a time limit of 10 minutes for the processing of 20,000 images on a Pixel 2 smartphone, which is powered by 
Qualcomm Snapdragon 835 Mobile Platform, is the goal of the track 1 challenge, one of the three tracks in the competition that are sponsored by Google and Facebook this year. The competition uses a large dataset that consists of about 1.2M JPEG images cross 1000 different categories as the training data, and a holdout image set as the testing data.
The public competition was driven by the real-world need for accurate image classification neural network models that run real-time on mobile devices. On top of accuracy, computational efficiency is critical on battery-powered devices. In the competition, our team took first place for speed and accuracy, by using a quantization-friendly MobileNet V2 architecture together with an advanced post-quantization scheme. We modified the graph in Tensorflow by inserting FakeQuantization nodes with calculated min and max values of each layer, and used Tensorflow Lite to convert the graph to .tflite for hardware deployment.
|  | 
| Qualcomm Canada Inc’s team: Parker Zhang, Liang Shen, Chen Feng, Terry Sheng, Jay Zhuo, and Zhiyu Liang. | 
Our model achieves the highest accuracy in recognizing the 20,000 images with 28 millisecond per inference on a single ARM CPU.
“This challenge aligned perfectly with our AI strategy,” says Mickey Aleksic, Vice President, Engineering, Qualcomm Technologies, Inc. “Winning this challenge goes a long way in recognizing Qualcomm Technologies’ key role in machine learning and making on-device AI ubiquitous.” To learn more about our AI research, 
click here.
Achieving lightning-fast, on-device image recognition
Accurate and fast image recognition on the edge device requires several steps.
- Create and train a neural network model to identify and classify images in floating-point.
 
- Convert the floating-point model to a fixed-point model that can efficiently run on edge devices without latency and accuracy issue.
 
Our team’s model is based on 
MobileNet v2, but is modified in a way that is “quantization-friendly”. Although Google’s MobileNet models successfully reduce parameter size and computation latency due to the use of separable convolution, directly 
quantizing a pre-trained MobileNet v2 model can cause a loss in accuracy. The team analyzed and identified the root cause of accuracy loss due to quantization in such separable convolution networks and solved it without utilizing quantization-aware re-training. With 
quantization aware training, one can obtain good accuracy, and our approach is an alternative that modifies the network architecture to solve the problem of quantization without retraining. An alternative, more end-to-end approach, is to use 
Learn2Compress, Google’s ML framework for directly training efficient on-device models from scratch or an existing TensorFlow model by optimizing over multiple network architectures and combining quantization along with other techniques like distillation, pruning, and joint training.
Model architecture

In separable convolutions, depthwise convolution is applied on each channel independently. However, the min and max values used for weight quantization are taken collectively from all channels. An outlier in one channel may cause a quantization loss for the whole model due to an enlarged data range. Without data calculation across different channels, depthwise convolution may be prone to produce all-zero values of weights in one channel — this is commonly observed in both MobileNet v1 and v2 models. All-zero values in one channel means small variance. A large “scale” value for that specific channel would be expected while applying batch normalization transform directly after depthwise convolution. This hurts the representation power of the whole model.
As a solution, our team proposed an effective quantization-friendly separable convolution architecture, where the non-linearity operations (both batch normalization and ReLU6) between depthwise and pointwise convolution layers are all removed, letting the network learn proper weights to handle the batch normalization transform directly. In addition, ReLU6 is replaced with ReLU in all pointwise convolution layers. From various experiments in MobileNet v1 and v2 models, this architecture shows a significant accuracy boost in the 8-bit quantized pipeline.
Post-quantization techniques
Once the model structure is defined, a floating-point model can be trained on the dataset. During the post-quantization step, the model is run against a range of different inputs, one image in each class category from the training data, to collect min and max values as well as the data histogram distribution at each layer output. Values for optimal “step size” and “offset”, represented by ∆ and , that minimize the summation of quantization loss and saturation loss during a greedy search are picked for linear quantization. Given the calculated range of min and max values, TensorFlow Lite provides a path to convert a graph model to .tflite model that can be deployed on edge devices.
|  | 
| Qualcomm Technologies, Inc’s Ning Bi (picture above, right center) accepted the award on the team’s behalf. | 
Conclusions
Moving calculations to 8-bit while preserving a high level of accuracy is a key step for models to run fast and efficiently on edge devices. Our team spotted the quantization issue, analyzed and identified the root cause, and solved it. We then applied our findings to the image classification challenge and got to see our theoretical work come to life. You can learn more in our published paper, titled “A quantization-friendly separable convolution architecture for MobileNets” (
https://arxiv.org/abs/1803.08607).
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.