Accelerating TensorFlow Lite Micro on Cadence Audio Digital Signal Processors
March 24, 2022

Posted by Raj Pawate (Cadence) and Advait Jain (Google)

Digital Signal Processors (DSPs) are a key part of any battery-powered device offering a way to process audio data with a very low power consumption. These chips run signal processing algorithms such as audio codecs, noise canceling and beam forming.

Increasingly these DSPs are also being used to run neural networks such as wake-word detection, speech recognition, and noise suppression. A key part of enabling such applications is the ability to execute these neural networks as efficiently as possible.

However, productization paths for machine learning on DSPs can often be ad-hoc. In contrast, speech, audio, and video codecs have worldwide standards bodies such as ITU and 3GPP creating algorithms for compression and decompression addressing several aspects of quality measurement, fixed point arithmetic considerations and interoperability.

TensorFlow Lite Micro (TFLM) is a generic open-sourced inference framework that runs machine learning models on embedded targets, including DSPs. Similarly, Cadence has invested heavily in PPA-optimized hardware-software platforms such as Cadence Tensilica HiFi DSP family for audio and Cadence Tensilica Vision DSP family for vision.

Google and Cadence – A Multi-Year Partnership for Enabling AI at the Edge

This was the genesis of the collaboration between the TFLM team and the Audio DSP teams at Cadence, starting in 2019. The TFLM team is focusing on leveraging the broad TensorFlow framework and developing a smooth path from training to embedded and DSP deployment via an interpreter and reference kernels. Cadence is developing a highly optimized software library, called NeuralNet library (NNLIB), that leverages the SIMD and VLIW capabilities of their low-power HiFi DSPs. This collaboration started with three optimized kernels for one Xtensa DSP, and now encompasses over 50 kernels across a variety of platforms such as HiFi 5, HiFi 4, HiFi 3z, Fusion F1 as well as Vision DSPs such as P6, and includes the ability to offload to an accelerator, if available.

Additionally, we have collaborated to add continuous integration for all the optimized code targeted for the Cadence DSPs. This includes infrastructure that tests that every pull request to the TFLM repository passes all the unit tests for the Tensilica toolchain with various HiFix and Vision P6 cores. As such, we ensure that the combined TFLM and NNLIB open source software is both tightly integrated and has good automated test coverage.

Performance Improvements

Most recently, we have collaborated on adding optimizations for models that are quantized with int16 activations. Specifically in the domain of audio, int16 activations can be critical for the quality of quantized generative models. We expect that these optimized kernels will enable a new class of ML-powered audio signal processing. The table below shows a few operators that are required for implementing a noise suppression neural net. We show a 267x improvement in cycle count for a variant of SEANet, an example noise suppression neural net.

The following table shows the improvement with the optimized kernels relative to the reference implementations as measured with the Xtensa instruction set simulation tool.

Operator

Improvement

Transpose Conv

458x

Conv2D

287x

Sub

39x

Add

24x

Leaky ReLU

18x

Srided_Slice

10x

Pad

6x

Overall Network

267x

How to use these optimizations

All of the code can be used from the TFLite Micro GitHub repository.

To use HiFi 3z targeted TFLM optimizations, the following conditions need to be met:

  • the TensorFlow Lite (TFLite) flatbuffer model is quantized with int16 activations and int8 weights
  • it uses one or more of the operators listed in the table above
  • TFLM is compiled with OPTIMIZED_KERNEL_DIR=xtensa

For example, you can run Conv2D kernel integration tests with reference C++ code with:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa TARGET_ARCH=hifi4 XTENSA_CORE= test_integration_tests_seanet_conv

And compare that to the optimized kernels by adding OPTIMIZED_KERNEL_DIR=xtensa:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa TARGET_ARCH=hifi4 OPTIMIZED_KERNEL_DIR=xtensa XTENSA_CORE= test_integration_tests_seanet_conv

Looking Ahead

While the work thus far has been primarily focused on convolutional neural networks, Google and Cadence are also working together to develop an optimized LSTM operator and have released a first example of an LSTM-based key-word recognizer. We expect to expand on this and continue to bring optimized and production-ready implementations of the latest developments in AI/ML to Tensilica Xtensa DSPs.

Acknowledgements

We would like to acknowledge a number of our colleagues who have contributed to making this collaboration successful.

Cadence: Int16 optimizations: Manjunath CP, Bhanu Prakash Venkata, Anirban Mandal LSTM implementation: Niranjan Yadla, Lukman Rahumathulla, Manjunath CP, Pramodkumar Surana, Arjun Medinakere NNLIB optimizations: Vijay Pawar, Prasad Nikam, Harshavardhan, Mayur Jagtap, Raj Pawate

Google: Advait Jain, Deqiang Chen, Lawrence Chan, Marco Tagliasacchi, Nat Jeffries, Nick Kreeger, Pete Warden, Rocky Rhodes, Ting Yan, Yunpeng Li, Victor Ungureanu

Next post
Accelerating TensorFlow Lite Micro on Cadence Audio Digital Signal Processors

Posted by Raj Pawate (Cadence) and Advait Jain (Google) Digital Signal Processors (DSPs) are a key part of any battery-powered device offering a way to process audio data with a very low power consumption. These chips run signal processing algorithms such as audio codecs, noise canceling and beam forming. Increasingly these DSPs are also being used to run neural networks such as wake-word detect…