Make your model faster on CPU using ONNX

How to speed up a Tensorflow model by 200%?

Machine learning models nowadays require more and more compute power. According to a study from OpenAI, the compute power needed to train AI models is rising ever since it was first used in the 60’s. With the required compute power doubling every two years up until 2012. After 2012 the doubling of required computational power doubles every 3 to 4 months! 

Not only does this increase in compute power add more costs, it also has a very negative influence on the environment. The carbon footprint of the training phase of some AI models is equivalent to  five cars throughout their lifetimes, combined! If you want to reduce the compute power required to train an AI model you can do two things:

  • You can make the AI model smaller, but this is not preferred of course
  • Make the model more efficient

The latter option is what this article focuses on. You can use ONNX to make a Tensorflow model 200% faster, which eliminates the need to use a GPU instead of a CPU. Using a CPU instead of a GPU has several other benefits as well:

  • CPU have a broader availability and are cheaper to use 
  • CPUs can support larger memory capacities than even the best GPUs, like 2D image detection for example.
All steps in this article are extensively documented in a Jupyter notebook. 

What is ONNX? ONNX runtime

Many neural networks are developed using the popular library TensorFlow. However, as the title suggests, the speed-up will come from using ONNX. But what exactly is ONNX? ONNX stands for “Open Neural Network Exchange“ and is basically an open representation format for machine learning algorithms. It allows for portability, in other words, an ONNX model can run everywhere. You can simply import and export ONNX models in popular tools like PyTorch and TensorFlow for example. This is great on its own, but the added benefit is that you can choose the runtime. This means that you can optimize the model better for your purposes based on the model’s needs and the way it is run. We are not going into complicated optimizations in this article, however, we are going to make a very simple optimization. We will switch the standard TensorFlow library, with all of its unnecessary bloats, with something more optimized. The ONNX runtime will be used, an optimized runtime for ONNX algorithms easily used from Python. Is TensorFlow really bloated? Well, it can be for just inference purposes, something the ONNX runtime excels at because it does only that, inference.

Preparing the TensorFlow model

It is quite easy to convert a network in the `SavedModel` format from TensorFlow to ONNX. You can use the handy python program tf2onnx to do this. It does all the hard work for you. As long as you do not have a very exotic neural network, the following line will probably work:


python3 -m tf2onnx.convert --saved-model model --opset 13 --output model.onnx

Switching runtimes

Assuming you are running something similar to this for inference using TensorFlow:

from tensorflow.keras.models import load_model 
model = load_model(“model”)
out = model.predict(x)

We now have to use the ONNXruntime with our converted network instead:

import onnxruntime as rt 
sess = rt.InferenceSession("model.onnx") 
input_name = sess.get_inputs()[0].name
out = sess.run(None, {self.input_name: x})[0]

It does not get more simple than this. The only real difference is syntax related, and what you might notice is that the ONNX runtime is a bit more sensitive to input names, but these are also stored in the ONNX format, so we can easily look them up with the “`get_inputs”` method.

 

Results

Now some proof that this actually works. The easiest way is to simply run the two different scripts a few times with a stopwatch and see which one takes longer. I did something a bit more accurate, I made two similar deployments on our deployment platform UbiOps. UbiOps is a platform that allows everyone to quickly run a piece of R or Python code into a professional production environment.

 I then sent 100 requests to both of them and looked at the average time spent on computing one request. Here are the results:

Figure 1:

 

Average compute time per request example 1

Figure 2:

 

Average compute time per request example 2

Model performance

As you can see, this roughly doubles the performance of the model with minimal effort. Try for yourself by downloading the Jupyter notebook with all the steps.

If you want to know more about UbiOps take a look at our product page and our Tutorial (a repository of example Jupyter notebooks.)

 

Latest news

Turn your AI & ML models into powerful services with UbiOps