Boost inference speeds with NVIDIA TensorRT on UbiOps

Improving the efficiency and runtime speed of ML models has never been more relevant than now.. More and more big and expensive machines are constantly running massive ML-models with high inference processing times being a huge bottleneck in the total throughput of the system. In order to keep up with the demand for faster processing times, hardware is actively and consistently being improved. When simple CPU processors aren’t enough anymore, GPUs come into play. GPUs can compute certain workloads much faster than any processor ever could, but can we optimize our programs to utilize all this power?

This article will go into detail about TensorRT from NVIDIA and how to use this framework on the UbiOps infrastructure. TensorRT is a framework that optimizes ML-models for NVIDIA GPUs to deliver low latency and high throughput for inference applications!

The use of GPUs for ML computing

Before diving straight into how to make efficient use of GPUs, let’s take a step back and look at why using a GPU is beneficial, sometimes even necessary, for many different applications, especially for machine learning.

CPUs are fast for general use cases, mostly sequential workloads, but they can’t handle big highly parallelizable workloads very well. GPUs are however perfect to handle this case! GPUs have many more cores compared to CPUs that make them very capable of handling parallel workloads, with one of the prime examples being ML-modes!

Luckily, NVIDIA has been a leader in the use of GPUs for AI and machine learning. They have created many popular frameworks that support the use of NVIDIA GPUs for parallel workloads, i.e. CUDA and TensorRT. Simply explained, CUDA is a parallel computing platform and programming model that helps developers use NVIDIA GPUs to speed up different workloads. TensorRT will be discussed in the following section.

What exactly is TensorRT

Now to answer one of the bigger questions, what exactly is TensorRT? NVIDIA TensorRT is an SDK for high-performance deep learning inference built on top of CUDA. It is able to optimize ML-models on many different levels, from model data type quantization to GPU memory optimizations.

These optimization techniques do however come with a cost, namely reduced accuracy. For many different applications the reduced accuracy is hugely outweighed by the performance increase, but this might not be the case for some specific models. Always test to see if the accuracy drop is acceptable.

More detailed information can be found on the NVIDIA TensorRT website.

Tensor-rt-optimizer-850x480 Image by NVIDIA, from the NVIDIA TensorRT SDK website page 

Why is UbiOps perfect for using TensorRT?

Now to answer the biggest question, why should you use UbiOps together with TensorRT?

UbiOps is a platform that makes it easy to implement and scale your Python/R programs with different hardware! For example, you are able to select different GPUs with the GPUs ranging from the efficient and powerful NVIDIA T4 all the way to the absolute powerhouse of a NVIDIA A100, all depending on your use case. Ubiops is furthermore an on-demand service; you pay for what you use. This comes in handy especially when your program needs to use expensive GPUs. By using TensorRT together with UbiOps, you’ll get results quicker due to the increase in performance from TensorRT, but also pay less due to the decreased execution time of your program!


In order to test the start-up time and different inference times for different GPUs and different inference techniques (CPU, GPU and GPU with TensorRT), we have written a test script.   The test script will test the performance impact of TensorRT on a UbiOps deployment using either an NVIDIA T4 GPU or an NVIDIA A100. 

The model used for the small benchmark test will be the Resnet152 ONNX model, pretrained on the ImageNet database. This pretrained model is used to classify images to 1000 different classes.

The results can be seen in the following figures:

Boost inference speeds with NVIDIA TensorRT on UbiOps Boost inference speeds with NVIDIA TensorRT on UbiOps

As can be seen in the figures above, using TensorRT results in the shortest inference time (while still classifying correctly!), followed by using a GPU with CUDA, with solely using a CPU coming in last by far. Using TensorRT therefore results in a particularly big performance increase compared to not optimizing the model with TensorRT (and just using CUDA together with a NVIDIA GPU). Furthermore, the GPU used makes a big difference, with the more powerful A100 performing more than twice as fast.

The start-up time is however almost the exact opposite of the inference time. A sharp increase in start-up times can be noted for the use of a GPU compared to using a CPU and an even higher increase when TensorRT is being used. This is the result from TensorRT optimizing the model for inference on the GPU. Also, initializing the GPU takes some time compared to using a CPU instance This results in an increase in start-up time for when a GPU is being used and an even higher increase in start-up time for when TensorRT is being used. This start-up time is only noticeable when the instance running the model needs a cold-start. Only the inference time will be noticed if the instance/model is already up and running.

How to use TensorRT on UbiOps

Now that the pros and cons of using TensorRT have been discussed with some benchmarks, it is time to show how to use TensorRT on UbiOps. An extensive tutorial is already available here.

It is recommended to follow that tutorial to get an in-depth understanding about what exactly is needed to run TensorRT on UbiOps. A short list of steps with some information will however be provided here to get a general idea on how everything can be set up!

1. Create a file with your TensorRT code!
The following code block is an example code block using ONNX Runtime and the ResNet152-v2 ONNX Model to utilize TensorRT!

import os

import time

import urllib.request

import numpy as np

import onnxruntime as rt

from PIL import Image

class Deployment:

   def __init__(self, base_directory, context):

       # Check if the model exists

       if not os.path.exists('resnet152-v2-7.onnx'):

           # Download the model

           print('Downloading model...')





           print('Model downloaded')

       # Check if the labels file exists

       if not os.path.exists('synset.txt'):

           # Download the labels file

           print('Downloading labels...')





           print('Labels downloaded')

       # Load the model and set available providers - TensorRT, CUDA, CPU

       self.sess = rt.InferenceSession('resnet152-v2-7.onnx',


       # Define input and output names

       self.input_name = self.sess.get_inputs().name

       self.output_name = self.sess.get_outputs().name

       # Load the labels file

       with open('synset.txt', 'r') as f:

           self.labels = [line.strip() for line in f.readlines()]

   def request(self, data):

       # Open the image using PIL

       img =

       # Resize the image to the input size expected by the model

       img = img.resize((224, 224))

       # Convert the image to a numpy array

       img = np.asarray(img)

       # Convert the image to the format expected by the model (RGB, float32)

       img = img[:, :, :3]  # remove alpha channel if present

       img = img.transpose((2, 0, 1))  # change from HWC to CHW format

       img = img.astype(np.float32) / 255.0  # normalize pixel values to [0, 1]

       # Normalize using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].

       img = (img - 0.485) / 0.229

       img = (img - 0.456) / 0.224

       img = (img - 0.406) / 0.225

       img = np.expand_dims(img, axis=0)  # add batch dimension

       # Run inference and time it

       start = time.time()

       output =[self.output_name], {self.input_name: img}).flatten()

       end = time.time()

       inference_time = end - start

       # Get the top 5 predicted classes and their probabilities

       top_idx = np.argsort(output)[::-1]

       top_prob = output[top_idx]

       # Create the predictions

       predictions = :.3f}' for i in range(5)]

       # Return the predictions and the inference time

       return {

           'predictions': predictions,

           'time': inference_time


2. Specify the Python packages to install in the requirements.txt file!
For the preceding example code, the following packages should be installed:






3. Tell Ubuntu where TensorRT is installed by specifying the installation folder in the Ubiops.yaml file


 - LD_LIBRARY_PATH=/var/deployment_instance/venv/lib/python3.10/site-packages/tensorrt/:${LD_LIBRARY_PATH}

4. Create a deployment with GPU access and CUDA installed
GPU access is necessary!

GPU-Access UbiOps

5. Package and upload the code!

6. Use the model!

Results UbiOps

In summary

Using GPUs alongside CPUs to do ML-model inference is a great step to take if speed and performance is crucial. Using TensorRT on top of this to optimize the use of GPUs for inference tasks will result in performance improvements one could only dream of when using simply a CPU.

To help in the process of getting your model up and running easily, UbiOps could be the tool for you!

As can be seen  in the tutorial above, most of the work needed for running your code is already done by UbiOps! Implementing TensorRT with UbiOps lets you focus on creating your TensorRT model instead of spending countless hours on trying to implement the model and get everything working correctly.

Ready to save countless hours and focus on actually creating your own models instead of implementing them? Don’t hesitate to contact us and see first-hand how UbiOps can streamline your ML model deployment!

Get in touch with our experts to see what we can do for you, or create a free account to try it for yourself!

Latest news

Turn your AI & ML models into powerful services with UbiOps