https://blog.tensorflow.org/2018/04/speed-up-tensorflow-inference-on-gpus-tensorRT.html

Community

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg89ciEGdSKLRCQoeHv3mCqvnzV3GSh4pW9u3DJuHVC_a_Gb4wWEJgIEEE9mU6OIhLIhyphenhypheng49vzjjQ5QbPXjZPy5AcsAYEaQFLTtQRaQkJcT61CHGugg3QjogKIBYvVfzYnKtskD2_AZ-G8/s1600/fig1.png

April 18, 2018 —
*Posted by Siddharth Sharma — Technical Product Marketing Manager, NVidia;
Sami Kama — Deep Learning Developer Technologist, NVidia;
Julie Bernauer — Pursuit Engineering Solution Architect, NVidia;
Laurence Moroney — Developer Advocate, Google*

OverviewTensorFlow remains the most popular deep learning framework today, with tens of thousands of users worldwide. NVIDIA® TensorRT™ is a deep learning pl…

Speed up TensorFlow Inference on GPUs with TensorRT

Sami Kama — Deep Learning Developer Technologist, NVidia;

Julie Bernauer — Pursuit Engineering Solution Architect, NVidia;

Laurence Moroney — Developer Advocate, Google

Figure 1. TensorRT optimizes trained neural network models to produce deployment-ready runtime inference engines. |

Figure 2 (a): An example convolutional neural network with multiple convolutional and activation layers. (b) TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead. |

TensorRT sped up TensorFlow inference by 8x for low latency runs of the ResNet-50 benchmark. These performance improvements cost only a few lines of additional code and work with the TensorFlow 1.7 release and later. In this article we will describe the new workflow and APIs to help you get started with it.

Figure 3: Workflow Diagram when using TensorRT within TensorFlow |

Let’s look at an example of a graph with three segments, A, B, and C. TensorRT optimizes Segment B, then replaces it with a single node. During inference, TensorFlow executes A, calls TensorRT to execute B, and then TensorFlow executes C. From a user’s perspective, you continue to work in TensorFlow as earlier.

TensorRT optimizes the largest sub-graphs possible in the TensorFlow graph. The more compute in the subgraph, the greater benefit obtained from TensorRT. You want most of the graph optimized and replaced with the fewest number of TensorRT nodes for best performance. Based on the operations in your graph, it’s possible that the final graph might have more than one TensorRT nodes. With the TensorFlow API, you can specify the minimum number of the nodes in a sub-graph for it to be converted to a TensorRT node. Any sub-graph with less than the specified set number of nodes will not be converted to TensorRT engines even if it is compatible with TensorRT. This can be useful for models containing small compatible sub-graphs separated by incompatible nodes, in turn leading to tiny TensorRT engines.

Let’s look at how to implement the workflow in more detail.

`per_process_gpu_memory_fraction`

parameter of the `GPUOptions`

function. This parameter needs to be set the first time the TensorFlow-TensorRT process starts. For example, setting `per_process_gpu_memory_fraction`

to 0.67 allocates 67% of GPU memory for TensorFlow and the remaining third for TensorRT engines.```
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction =
0 < memory_for_TensorFlow < 1)
```

The next step is letting TensorRT analyze the TensorFlow graph, apply optimizations, and replace subgraphs with TensorRT nodes. You apply TensorRT optimizations to the frozen graph with the new `create_inference_graph function`

. This function uses a frozen TensorFlow graph as input, then returns an optimized graph with TensorRT nodes, as shown in the following code snippet:```
trt_graph = trt.create_inference_graph(
input_graph_def = frozen_graph_def,
outputs = output_node_name,
max_batch_size=batch_size,
max_workspace_size_bytes=workspace_size,
precision_mode=precision,
minimum_segment_size=3)
```

Let’s look at the function’s parameters:`input_graph_def`

: frozen TensorFlow graph`outputs`

: list of strings with names of output nodes e.g.[“`resnet_v1_50/predictions/Reshape_1`

”]`max_batch_size`

: integer, size of input batch e.g. 16`max_workspace_size_bytes`

: integer, maximum GPU memory size available for TensorRT`precision_mode`

: string, allowed values “FP32”, “FP16” or “INT8”`minimum_segment_size`

: integer (default = 3), control min number of nodes in a sub-graph for TensorRT engine to be createdThe

`per_process_gpu_memory_fraction`

and `max_workspace_size_bytes`

parameters should be used together to split GPU memory available between TensorFlow and TensorRT to get providing best overall application performance.To maximize inference performance, you might want to give TensorRT slightly more memory than what it needs, giving TensorFlow the rest. For example, if you set the `per_process_gpu_memory_fraction`

parameter to ( 12–4 ) / 12 = 0.67, then setting Figure 4. (a) ResNet-50 graph in TensorBoard (b) ResNet-50 after TensorRT optimizations have been applied and the sub-graph replaced with a TensorRT node. |

Fig. 5: Matrix processing operations on Tensor Cores |

`precision_mode`

parameter in the `create_inference_graph`

function to enable half precision, as shown below. *getNetwork()*

is a helper function that reads the frozen network from the protobuf file and returns a` tf.GraphDef() `

of the network.```
trt_graph = trt.create_inference_graph(
getNetwork(network_file_name),
outputs,
max_batch_size=batch_size,
max_workspace_size_bytes=workspace_size,
precision_mode=”FP16")
```

Figure 6 shows ResNet-50 performing 8 times faster under 7 ms latency with the TensorFlow-TensorRT integration using NVIDIA Volta Tensor Cores versus running TensorFlow only on the same hardware.Fig. 6: ResNet-50 inference throughput performance |

Figure 7. Workflow incorporating INT8 inference |

`create_inference_graph`

function, setting the `precision_mode`

parameter set to “```
trt_graph = trt.create_inference_graph(
getNetwork(network_file_name),
outputs,
max_batch_size=batch_size,
max_workspace_size_bytes=workspace_size,
precision_mode=”INT8")
```

Now run the calibration graph with calibration data. TensorRT uses the distribution of node data to quantize weights for the nodes. It’s imperative you use calibration data closely reflecting the distribution of the problem dataset in production. We suggest checking for error accumulation during inference when first using models calibrated with INT8. The *minimum_segment_size *

parameter can help tune the optimized graph to minimize quantization-errors. Using *minimum_segment_size,*

you can change the minimum number of nodes in the optimized INT8 engines to change the final optimized graph to fine tune result accuracy.After executing the graph on calibration data, apply TensorRT optimizations to the calibration graph with the

`calib_graph_to_infer_graph`

function. This function also replaces the TensorFlow subgraph with a TensorRT node optimized for INT8. The output of the function is a frozen TensorFlow graph that can be used for inference as usual.`trt_graph=trt.calib_graph_to_infer_graph(calibGraph)`

All it takes are these two commands to enable INT8 precision inference with your TensorFlow model.If you want to check out the examples shown here, check out code required to run these examples at https://developer.download.nvidia.com/devblogs/tftrt_sample.tar.xz

Find instructions on how to get started today at: https://www.tensorflow.org/install/install_linux

In the near future, we expect the standard pip install process to work as well. Stay tuned!

We believe you’ll see substantial benefits to integrating TensorRT with TensorFlow when using GPUs. You can find more information on TensorFlow at https://www.tensorflow.org/.

Additional information on TensorRT can be found on NVIDIA’s TensorRT page at https://developer.nvidia.com/tensorrt.

Next post

Community
**·**

Speed up TensorFlow Inference on GPUs with TensorRT

April 18, 2018
—
*Posted by Siddharth Sharma — Technical Product Marketing Manager, NVidia;
Sami Kama — Deep Learning Developer Technologist, NVidia;
Julie Bernauer — Pursuit Engineering Solution Architect, NVidia;
Laurence Moroney — Developer Advocate, Google*

OverviewTensorFlow remains the most popular deep learning framework today, with tens of thousands of users worldwide. NVIDIA® TensorRT™ is a deep learning pl…

Build, deploy, and experiment easily with TensorFlow