Load-testing TensorFlow Serving’s REST Interface
јул 28, 2022

Posted by Chansung Park and Sayak Paul (ML-GDEs)

In this post, we’ll share the lessons and findings learned from conducting load tests for an image classification model across numerous deployment configurations. These configurations involve REST-based deployments with TensorFlow Serving. In this way, we aim to equip the readers with a holistic understanding of the differences between the configurations.

This post is less about code and more about the architectural decisions we had to make for performing the deployments. We’ll first provide an overview of our setup including the technical specifications. We’ll also share our commentaries on the design choices we made and their impact.

Technical Setup

TensorFlow Serving is feature-rich and has targeted specifications embedded in its designs (more on this later). For online prediction scenarios, the model is usually exposed as some kind of service.

To perform our testing we use a pre-trained ResNet50 model which can classify a variety of images into different categories. We then serve this model in the following way:

Our deployment platform (nodes on the Kubernetes Cluster) is CPU-based. We don’t employ GPUs at any stage of our processes. For this purpose, we can build a CPU-optimized TensorFlow Serving image and take advantage of a few other options which can reduce the latency and boost the overall throughput of the system. We will discuss these later in the post.

You can find all the code and learn how the deployments were performed in this repository. Here, you’ll find example notebooks and detailed setup instructions for playing around with the code. As such, we won’t be discussing the code line by line but rather shed light on the most important parts when necessary.

Throughout the rest of this post, we’ll discuss the key considerations for the deployment experiments respective to TensorFlow Serving including its motivation, limitations, and our experimental results.

With the emergence of serverless offerings like Vertex AI, it has never been easier to deploy models and scale them securely and reliably. These services help reduce the time-to-market tremendously and increase overall developer productivity. That said, there might still be instances where you’d like more granular control over things. This is one of the reasons why we wanted to do these experiments in the first place.

Considerations

TensorFlow Serving has its own sets of constraints and design choices that can impact a deployment. In this section, we provide a concise overview of these considerations.

Deployment infrastructure: We chose GKE because Kubernetes is a standard deployment platform when using GCP, and GKE lets us focus on the ML parts without worrying about the infrastructure since it is a fully managed Google Cloud Platform service. Our main interest is in how to deploy models for CPU-based environments, so we have prepared a CPU-optimized TensorFlow Serving image.

Trade-off between more or fewer servers: We started experiments for TensorFlow Serving setups with the simplest possible VMs equipped with 2vCPU and 4GB RAM, then we gradually upgraded the specification up to 8vCPU and 64GB RAM. On the other hand, we decreased the number of nodes in the Kubernetes cluster from 8 to 2 because it is a trade-off between costs to deploy cheaper servers versus fewer expensive servers.

Options to benefit multi-core environments: We wanted to see if high-end VMs can outperform simple VMs with options to take advantage of the multi-core environment even though there are fewer nodes. To this end, we experimented with a different number inter_op_parallelism and intra_op_parallelism threads for TensorFlow Serving deployment set according to the number of CPU cores.

Dynamic batching and other considerations: Modern ML frameworks such as TensorFlow Serving usually support dynamic batching, initial model warm-up, multiple deployments of multiple versions of different models, and more out of the box. For our purpose of online prediction, we have not tested these features carefully. However, dynamic batching capability is also worth exploring to enhance the performance according to the official document. We have seen that the default batching configuration could reduce the latency a little even though the results of that are not included in this blog post.

Experiments

We have prepared the following environments. In TensorFlow Serving, the number of intra_op_parallelism_threads is set equal to the number of CPU cores while the number of inter_op_parallelism_threads is set from 2 to 8 for experimental purposes as it controls the number of threads to parallelize the execution of independent operations. Below we provide the details on the adjustments we performed on the number of vCPUs, RAM size, and the number of nodes for each Kubernetes cluster. Note that the number of vCPUs and the RAM size are applicable for the cluster nodes individually.

The load tests are conducted using Locust. We have run each load test for 5 minutes. The number of requests are controlled by the number of users, and it depends on the circumstances on the client side. We increased the number of users by one every second up to 150 which we found the handled number of requests reaches the plateau, and the requests are spawned every second to understand how TensorFlow Serving behaves. So you can assume that requests/second doesn't reflect the real-world situation where clients try to send requests at any time.

We experimented with the following node configurations on a Kubernetes cluster. The configurations are read like so: {num_vcpus_per_node}-{ram}_{num_nodes}:

  • 2vCPUs, 4GB RAM, 8 Nodes
  • 4vCPUs, 8GB RAM, 4 Nodes
  • 8vCPUs, 16GB RAM, 2 Nodes
  • 8vCPUs, 64GB RAM, 2 Nodes

You can find code for experimenting with these different configurations in the above-mentioned repositories. The deployment for each experiment is provisioned through Kustomize to overlay the base configurations, and file-based configurations are injected through ConfigMap.

Results

This section presents the results for each of the above configurations and suggests which configuration is the best based on the environments we considered. As per Figure 1, the best configuration and the environmental setup is observed as 2 nodes, 8 intra_op_parallelism_threads, 8 inter_op_parallelism_threads, 8vCPUs, 16GB RAM based on the result.

Figure 1: Comparison between different configurations of TensorFlow Serving (original).

We have observed the following aspects by picking the best options.

  • TensorFlow Serving is more efficient when deployed on fewer, larger (more CPU and RAM) machines, but the RAM capacity doesn’t have much impact on handling more requests. It is important to find the right number of inter_op_parallelism_threads with experimentation. With a higher number the better performance is not always guaranteed even when the nodes are equipped with high-capacity hardware.

TensorFlow Serving focuses more on reliability than throughput performance. We believe it sacrifices some throughput performance to achieve reliability, but this is the expected behavior of TensorFlow Serving, as stated in the official document. Even though handling as many requests as possible is important, keeping the server as reliable as possible is also substantially important when dealing with a production system.

There is a trade-off between performance and reliability, so you must be careful to choose the right one. However, it seems like the throughput performance of TensorFlow Serving is close enough to results from other frameworks such as FastAPI, and if you want to factor in richer features such as dynamic batching and sharing GPU resources efficiently between models, we believe TensorFlow Serving is the right one to choose.

Note on gRPC and TensorFlow Serving

We are dealing with an image classification model for the deployments, and the input to the model will include images. Hence the size of the request payload can spiral up depending on the image resolution and fidelity. Therefore it’s particularly important to ensure the message transmission is as lightweight as possible. Generally, message transmission is quite a bit faster in gRPC than REST. This post provides a good discussion on the main differences between REST and gRPC APIs.

TensorFlow Serving can serve a model with gRPC seamlessly, but comparing the performance of a gRPC API and REST API is non-trivial. This is why we did not include that in this post. The interested readers can check out this repository that follows a similar setup but uses a gRPC server instead.

Costs

We used the GCP cost estimator for this purpose. Pricing for each experiment configuration was assumed to be live for 24 hours per month (which was sufficient for our experiments).

Machine Configuration (E2 series)

Pricing (USD)

2vCPUs, 4GB RAM, 8 Nodes

11.15

4vCPUs, 8GB RAM, 4 Nodes

11.15

8vCPUs, 16GB RAM, 2 Nodes

11.15

8vCPUs, 64GB RAM, 2 Nodes

18.21

Conclusion

In this post, we discussed some critical lessons we learned from our experience of load-testing a standard image classification model. We considered the industry-grade framework for exposing the model to the end-users – TensorFlow Serving. While our setup for performing the load tests may not fully resemble what happens in the wild, we hope that our findings will at least act as a good starting point for the community. Even though the post demonstrated our approaches with an image classification model, the approaches should be fairly task-agnostic.

In the interest of brevity, we didn’t do much to push further the efficiency aspects of the model in both the APIs. With modern CPUs, software stack, and OS-level optimizations, it’s possible to improve the latency and throughput of the model. We redirect the interested reader to the following resources that might be relevant:

Acknowledgements

We are grateful to the ML Ecosystem team that provided GCP credits for supporting our experiments. We also thank Hannes Hapke and Robert Crowe for providing us with helpful feedback and guidance.

Next post
Load-testing TensorFlow Serving’s REST Interface

Posted by Chansung Park and Sayak Paul (ML-GDEs) In this post, we’ll share the lessons and findings learned from conducting load tests for an image classification model across numerous deployment configurations. These configurations involve REST-based deployments with TensorFlow Serving. In this way, we aim to equip the readers with a holistic understanding of the differences between the configur…