TensorFlow Lite Core ML delegate enables faster inference on iPhones and iPads

4월 02, 2020

Posted by Tei Jeong and Karim Nosseir, Software Engineers

TensorFlow Lite offers options to delegate part of the model inference, or the entire model inference, to accelerators, such as the GPU, DSP, and/or NPU for efficient mobile inference. On Android, you can choose from several delegates: NNAPI, GPU, and the recently added Hexagon delegate. Previously, with Apple's mobile devices — iPhones and iPads — the only option was the GPU delegate.

When Apple released its machine learning framework Core ML and Neural Engine (a neural processing unit (NPU) in Apple's Bionic SoC) this allowed TensorFlow Lite to leverage Apple's hardware.

Neural processing units (NPUs), similar to Google’s Edge TPU and Apple's Neural Engine, are specialized hardware accelerators designed to accelerate machine learning applications. These chips are designed to speed up model inference on mobile or edge devices and use less power than running inference on the CPU or GPU.

Today, we are excited to announce a new TensorFlow Lite delegate that uses Apple's Core ML API to run floating-point models faster on iPhones and iPads with the Neural Engine. We are able to see performance gains up to 14x (see details below) for models like MobileNet and Inception V3.

Figure 1: High-level overview of how a delegate works at runtime. Supported portions of the graph run on the accelerator, while other operations run on the CPU via TensorFlow Lite kernels.

Which devices are supported?

This delegate runs on iOS devices with iOS version 12 or later (including iPadOS). However, to get true performance benefits, it should run on devices with Apple A12 SoC or later (for example, iPhone XS). For older iPhones, you should use the TensorFlow lite GPU delegate to get faster performance.

Which models are supported?

With this initial launch, 32-bit floating point models are supported. A few examples of supported models include, but not limited to, image classification, object detection, object segmentation, and pose estimation models. The delegate supports many compute-heavy ops such as convolutions, though there are certain constraints for some ops. These constraints are checked before delegation in runtime so that unsupported ops automatically fall back to the CPU. The complete list of ops and corresponding restrictions (if any) are in the delegate’s documentation.

Impacts on performance

We tested the delegate with two common float models, MobileNet V2 and Inception V3. Benchmarks were conducted on the iPhone 8+ (A11 SoC), iPhone XS (A12 SoC) and iPhone 11 Pro (A13 SoC), and tested for three delegate options: CPU only (no delegate), GPU, and Core ML delegate. As mentioned before, you can see the accelerated performance on models with A12 SoC and later, but on iPhone 8+ — where Neural Engine is not available for third parties — there is no observed performance gain when using the Core ML delegate with small models. For larger models, performance is similar to GPU delegate.

In addition to model inference latency, we also measured startup latency. Note that accelerated speed comes at a tradeoff with delayed startup. For the Core ML delegate, startup latency increases along with the model size. For example, on smaller models like MobileNet, we observed a startup latency of 200-400ms. On the other hand, for larger models, like Inception V3, the startup latency could be 2-4 seconds. We are working on reducing the startup latency. The delegate also has an impact on the binary size. Using the Core ML delegate may increase the binary size by up to 1MB.

Models

MobileNet V2 (1.0, 224, float) [download] : Image Classification

Small model. Entire graph runs on Core ML.

Inception V3 [download] : Image Classification

Large model. Entire graph runs on Core ML.

Devices

iPhone 8+ (Apple A11, iOS 13.2.3)
iPhone XS (Apple A12, iOS 13.2.3)
iPhone 11 Pro (Apple A13, iOS 13.2.3)

Latencies and speed-ups observed for MobileNet V2

Figure 2: Latencies and speed-ups observed for MobileNet V2. All versions use a floating-point model. CPU Baseline denotes two-threaded TensorFlow Lite kernels.
* GPU: Core ML uses CPU and GPU for inference. NPU: Core ML uses CPU and GPU, and NPU(Neural Engine) for inference.

Latencies and speed-ups observed for Inception V3

Figure 3: Latencies and speed-ups observed for Inception V3. All versions use a floating-point model. CPU Baseline denotes two-threaded TensorFlow Lite kernels.
* GPU: Core ML uses CPU and GPU for inference. NPU: Core ML uses CPU and GPU, and NPU(Neural Engine) for inference.

How do I use it?

You simply have to make a call on the TensorFlow Lite Interpreter with an instance of the new delegate. For a detailed explanation, please read the full documentation. You can use either the Swift API (example below) or the C++ API (shown in the documentation) to invoke the TensorFlow Lite delegate during inference.

Swift example

This is how invoking this delegate would look from a typical Swift application. All you have to do is create a new Core ML delegate and pass it to the original interpreter initialization code.

let coreMLDelegate = CoreMLDelegate()
let interpreter = try Interpreter(modelPath: modelPath,
                                  delegates: [coreMLDelegate])

Future work

Over the coming months, we will improve upon the existing delegate with more op coverage and additional optimizations. Support for models trained with post-training float16 quantization is on the roadmap. This will allow acceleration of models with about half the model size and small accuracy loss.

Support for post-training weight quantization (also called dynamic quantization) is also on the roadmap.

Feedback

This was a common feature request that we got from our developers. We are excited to release this and are looking forward to hearing your thoughts. Share your use case directly or on Twitter with hashtags #TFLite and #PoweredByTF. For bugs or issues, please reach out to us on GitHub.

Acknowledgements

Thank you to Tei Jeong, Karim Nosseir, Sachin Joglekar, Jared Duke, Wei Wei, Khanh LeViet.

Note: Core ML, Neural Engine and Bionic SoCs (A12, A13) are products of Apple Inc.

TensorFlow Lite ·

TensorFlow Lite Core ML delegate enables faster inference on iPhones and iPads

4월 02, 2020 — Posted by Tei Jeong and Karim Nosseir, Software Engineers

TensorFlow Lite offers options to delegate part of the model inference, or the entire model inference, to accelerators, such as the GPU, DSP, and/or NPU for efficient mobile inference. On Android, you can choose from several delegates: NNAPI, GPU, and the recently added Hexagon delegate. Previously, with Apple's mobile devices — iPhone…