Accelerating TensorFlow Lite on Qualcomm Hexagon DSPs

december 16, 2019

Posted by Karim Nosseir and Sachin Joglekar, Software Engineers

Digital Signal Processors (DSPs), such as Hexagon DSPs, are microprocessors present on most modern phones alongside other compute units such as the CPU and GPU. Developed with the primary goal of improving communication and multimedia processing, these chips can dramatically speed up model inference on mobile / edge devices. DSPs are often also more power efficient than GPUs and are critical to unlocking use cases which depend on low power usage.

Today, we are excited to announce a new TensorFlow Lite delegate that utilizes Hexagon NN Direct to run quantized models faster on the millions of mobile devices with Hexagon DSPs. Using this, we are able to see performance gains in the range of 3-25x (see details below) for models like MobileNet and Inceptionv3!

Figure 1: High-level overview of how a delegate works at runtime. Supported portions of the graph run on the accelerator, while other operations run on the CPU via TensorFlow Lite kernels

This delegate complements NNAPI acceleration and is for devices that either don’t yet support NNAPI or do not have an NNAPI driver for the DSP.

The delegate supports most Qualcomm® Snapdragon™ SoCs including:

Snapdragon 835 (682 DSP)
Snapdragon 660/820/821 (680 DSP)
Snapdragon 710/845 (685 DSP)
Snapdragon 855 (690 DSP)

Which models are supported?

With this initial launch, we aim to speed up inference for fixed-point (uint8) models trained with quantization-aware training. Examples of such models include the quantized variants on our Hosted Models page. Models built with post-training quantization require special handling of their tensors, which we are currently exploring with Qualcomm Technologies and is not included in this launch.

Specifically, the delegate is able to run many compute-heavy ops, such as convolutions, pooling and fully connected layers, albeit with certain constraints that are checked before delegation at runtime. Ops that do not meet the criteria are run on the CPU by TensorFlow Lite. The complete list of ops and corresponding restrictions (if any) are given in the delegate’s documentation.

Performance gains

To showcase performance gains with the new delegate, we did a test-drive with two common models on three devices with different Qualcomm® Snapdragon™ SoCs. We also benchmarked inference on the floating-point versions of these models with our GPU delegate for comparison.

Models

The quantized models have been trained with quantization-aware training.

SSD MobileNet V1 [download: quantized, floating-point] : Object Detection
- Small model, runs partly on Hexagon DSP. Post-processing runs on CPU.
Inception V3 [download: quantized, floating-point] : Image Classification
- Large model, runs mostly on the Hexagon DSP.

Devices (along with Qualcomm® Snapdragon™ SoC)

Oppo K1 (Snapdragon 660, Android 8.1)
Vivo Z3 (Snapdragon 710, Android 8.1)
Samsung S9 (Snapdragon 845, Android 9)

Figure 2: Latencies and speed-up (on Hexagon) observed for SSD MobileNet V1. GPU version uses floating-point model, while CPU/Hexagon run quantized version. CPU Baseline denotes single-threaded TensorFlow Lite kernels.

Figure 3: Latencies and speed-up (on Hexagon) observed for Inception-V3. GPU version uses floating-point model, while CPU/Hexagon run quantized version. CPU Baseline denotes single-threaded TensorFlow Lite kernels.

How do I use it?

This section gives a brief summary of the Hexagon delegate’s usage. Note that there is no device-specific coding involved - you just have to make a call on the Tensorflow Lite `Interpreter` with an instance of the new delegate. For a detailed explanation, please look at the full documentation.

Step 1: Add the TensorFlow Lite Hexagon AAR
Add the ‘tensorflow-lite-hexagon.aar’ to your app - this is in addition to the standard tensorflow-lite AAR (nightly or release).

Step 2: Obtain Hexagon libraries
Run “hexagon_nn_skel*.run” and accept the license agreement. It should provide 3 different shared libraries which need to be included in your app. The delegate will automatically pick the one with best performance depending on the device.

Step 3: Use the Delegate API at runtime
Use either the Java API (example below) or the C API (shown in the documentation) to invoke the TensorFlow Lite delegate during inference.

Java Example

// Create the Delegate instance.
Interpreter.Options tfliteOptions = new Interpreter.Options();
try {
  HexagonDelegate hexagonDelegate = new HexagonDelegate(activity);
  tfliteOptions.addDelegate(hexagonDelegate);
} catch (UnsupportedOperationException e) {
  // Hexagon delegate is not supported on this device.
}

tfliteInterpreter = new Interpreter(tfliteModel, tfliteOptions);

// Run inference.
while (true) {
  writeToInputTensor(inputTensor);
  tfliteInterpreter.run(inputTensor, outputTensor);
  readFromOutputTensor(outputTensor);
}

// Dispose of the delegate after being done.
tfliteInterpreter.close();
if (hexagonDelegate != null) {
  hexagonDelegate.close();
}

Binary size

The Hexagon delegate code is expected to add approximately 380KB for arm64-v8a or 202KB for armeabi-v7a to your Android APK. The Hexagon DSP libraries require an additional 3MB. If you know the exact architecture (v65/v66/v60) that you will be targeting, that comes down to 1MB. We are working with Qualcomm® Technologies for ways to reduce this size further.

Future work

Over the coming months, we will improve upon the existing delegate with more op-coverage and additional optimizations.

Adding support for our 8-bit symmetric quantization spec is also on the roadmap. This will allow acceleration of models quantized with the new post-training quantization tooling.

Feedback

This was a common feature request we got from our developers. We are happy to release it and are looking forward to hearing your thoughts. Share your use case directly or on Twitter as #TFLite and #PoweredByTF. For bugs or issues, please reach out to us on Github.

Acknowledgements

Karim Nosseir, Sachin Joglekar, Denise Ang, Jared Duke. We want to thank our collaborators at Qualcomm® Technologies for their contributions and help in making this possible.

Note: Qualcomm® Hexagon and Qualcomm® Snapdragon™ are products of Qualcomm® Technologies, Inc. and / or its subsidiaries.

TensorFlow Lite ·

Accelerating TensorFlow Lite on Qualcomm Hexagon DSPs

december 16, 2019 — Posted by Karim Nosseir and Sachin Joglekar, Software Engineers

Digital Signal Processors (DSPs), such as Hexagon DSPs, are microprocessors present on most modern phones alongside other compute units such as the CPU and GPU. Developed with the primary goal of improving communication and multimedia processing, these chips can dramatically speed up model inference on mobile / edge devices. DSPs are …