I wrote an article on how you can improve neural network inference performance by switching from TensorFlow to ONNX runtime. But now UbiOps also supports GPU inference. We all know GPU’s can improve performance a lot but how do you-get your ONNX model running on a GPU? And should I run all of my neural networks on GPU’s because it’s better? In this article I will look into exactly that.

For this article I used a model that can upscale images. It is available on the ONNX model zoo, a place where you can get pretrained models in ONNX format. The model is already pretty fast, however I have found that running it on a GPU can improve performance by a factor of two.

Because GPU’s for inference are not available on the free version of UbiOps. You can also test this on your local machine as long as you have a GPU and a working CUDA (Compute Unified Device Architecture) 11 installation, but that is out of the scope of this article. You will have to search for installation instructions yourself. If you have CUDA installed though you could use this to test the deployments locally.


Getting the models on UbiOps

I made a notebook and an importable zip so that you can reproduce my experiment exactly. If you run the cells under the header “Getting the models on UbiOps” everything will be setup for you (The notebook will explain itself).

But I also wanted to explain quickly how to do this manually. Say you have a UbiOps deployment that uses ONNX and you now want to accelerate it using a GPU? Well the only thing you have to do is create a new version and select a language that contains a version of CUDA.


When selecting the CUDA version make sure that you match it to what your ONNX version expects. You can find a table of that on this page. For example in this experiment we will use ONNX 1.10 and that requires CUDA 11.


Comparing ONNX performance CPU vs GPU

Now that we have two deployments ready to go we can start to look at the performance difference. In the Jupyter notebook you will also find a part about benchmarking. We are using a data set called imagenette. From that we sample 100 images and send them in a batch to both deployments. In the deployment code there is a little part that prints the average time it takes to do the inference. So the only thing we have to do is take a look at the logs of both deployments after all requests are done.

Here are the results:

It took the CPU version almost 0.06 seconds while the GPU version took almost 0.03 seconds to complete. That’s a speed-up of a factor 2! We have to keep in mind though that the CPU version was already very fast.

And if we compare this to the total request duration, this also includes file download/upload and other overhead to complete the request. We see that it is only a fraction of the total duration.

Average request time (s)

CPU: 0.476

GPU: 0.494



So a GPU speeds up the inference, great right? Well it is actually a bit more complicated than that. In this case the speed-up is hardly noticeable due to the fact that the model was already very fast on CPU. So the question is, assuming that using GPU’s for inference is more expensive to run, is it really worth it in this case to use GPU inference? I would say probably not.

However, that’s exactly why I like this experiment. Yes, there are big performance gains in the area of inference, which this experiment clearly shows, and we have even seen higher gains up to a factor of 60. But we are not doing inference in a vacuum, as we have seen in the experiment the inference was already a tiny part of the request. Making it even smaller is not necessary a huge benefit.


Run your neural network on GPU’s

So should you run all your neural networks on GPU’s using ONNX? I guess the answer is, like it often is, it depends. You have to put the inference performance in the perspective of your whole application. What performance gains am I getting? What kind of performance do I actually need? How much will the performance gain cost me extra?

Based on this you should make your own decision. At least ONNX makes it easy to quickly test between GPU and CPU.