Skip to content

Using TensorRT in UbiOps

Download notebook View source code

UbiOps provides the perfect infrastructure to run your machine learning models with NVIDIA TensorRT.

On this page, we will show how to run an inference job with TensorRT on UbiOps!

We will create a deployment where TensorRT is running, upload a model that uses TensorRT and then show the speed differences between running this model with TensorRT, CUDA or only the CPU. TensorRT and CUDA will be tested on 2 different GPU's; the NVIDIA T4 and the NVIDIA A100.

The model used will be the Resnet152 ONNX model, pretrained on the ImageNet database. This pretrained model is used to classify images to 1000 different classes.

This model will be run with different inference engines (TensorRT, CUDA, CPU) by making use of the ONNX Runtime Python package, all inside UbiOps.

The following software stack will be used inside UbiOps:

Technology Version
Operating System Ubuntu 22.04
Python Version 3.10
CUDA Version 11.7.1

Note: GPU Access is needed inside UbiOps!

1) Set project variables and initialize the UbiOps API Client

First, make sure you create an API token with project editor permissions in your UbiOps project and paste it below. Also fill in your corresponding UbiOps project name.

import yaml
import os

    "<insert-your-token-here>"  # Make sure this is in the format "Token token-code"
PROJECT_NAME = "<your-project-name>"  # Fill in the corresponding UbiOps project name
    "tensorrt-infer-tutorial"  # You can change this to any name for the deployment
DEPLOYMENT_VERSION_TENSORRT_T4 = "tensorrt-t4"  # Choose a name for the version. We use `tensorrt-t4` here, as we're gonna use TensorRT on a NVIDIA T4 GPU

ubiops_package_name_tensorrt = "ubiops_tensorrt_inference_pkg"
if not os.path.exists(ubiops_package_name_tensorrt):
    # This will create a new local folder to use for deployment files later

print(f"Your new deployment will be called: {DEPLOYMENT_NAME}.")

Initialize the UbiOps client library with the API token

Now we import the UbiOps Python client and authenticate with the API. You can install it with pip install ubiops

!pip install -qU ubiops

import ubiops

configuration = ubiops.Configuration(host="")
configuration.api_key["Authorization"] = API_TOKEN

client = ubiops.ApiClient(configuration)
api = ubiops.CoreApi(client)

2) Setup deployment environment

In order to make TensorRT running inside the deployment, we need to set up the environment of the deployment so that everything will run smoothly. This will be done by specifying the requirements.txt and the ubiops.yaml file.

a) Requirements.txt file

The requirements.txt file lists all the necessary packages that have to be installed in the deployment container. UbiOps will install these packages automatically for you.

%%writefile ubiops_tensorrt_inference_pkg/requirements.txt


b) ubiops.yaml file

The ubiops.yaml file is used to install additional software packages and set environment variables. The latter is what we're after.
We will specify where TensorRT is installed this way!

%%writefile ubiops_tensorrt_inference_pkg/ubiops.yaml

  - LD_LIBRARY_PATH=/var/deployment_instance/venv/lib/python3.10/site-packages/tensorrt/:${LD_LIBRARY_PATH}

3) Deployment code

The code used will be put inside a file. This code will be run when a call to the deployment is made.

We will set up the in such a way that it expects a file path as input.
The output will be an array of 5 strings with the top 5 predicted classes and their probabilities. The inference time will also be given in the output

The model input is contained in the data variable, which is a dictionary. We will return a dictionary including the array of top 5 predicted classes and the inference time:

Deployment input & output variables
Variable name Data type
Input fields file File
Output fields predictions Array of Strings
time Double precision

The following code will be used:

%%writefile ubiops_tensorrt_inference_pkg/

import os
import time
import urllib.request

import numpy as np
import onnxruntime as rt
from PIL import Image

class Deployment:
    def __init__(self, base_directory, context):
        # Check if the model exists
        if not os.path.exists('resnet152-v2-7.onnx'):
            # Download the model
            print('Downloading model...')
            print('Model downloaded')

        # Check if the labels file exists
        if not os.path.exists('synset.txt'):
            # Download the labels file
            print('Downloading labels...')
            print('Labels downloaded')

        # Load the model and set available providers - TensorRT, CUDA, CPU
        self.sess = rt.InferenceSession('resnet152-v2-7.onnx',
                                        providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider',

        # Define input and output names
        self.input_name = self.sess.get_inputs()[0].name
        self.output_name = self.sess.get_outputs()[0].name

        # Load the labels file
        with open('synset.txt', 'r') as f:
            self.labels = [line.strip() for line in f.readlines()]

    def request(self, data):
        # Open the image using PIL
        img =['file'])

        # Resize the image to the input size expected by the model
        img = img.resize((224, 224))

        # Convert the image to a numpy array
        img = np.asarray(img)

        # Convert the image to the format expected by the model (RGB, float32)
        img = img[:, :, :3]  # remove alpha channel if present
        img = img.transpose((2, 0, 1))  # change from HWC to CHW format
        img = img.astype(np.float32) / 255.0  # normalize pixel values to [0, 1]

        # Normalize using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].
        img[0] = (img[0] - 0.485) / 0.229
        img[1] = (img[1] - 0.456) / 0.224
        img[2] = (img[2] - 0.406) / 0.225

        img = np.expand_dims(img, axis=0)  # add batch dimension

        # Run inference and time it
        start = time.time()
        output =[self.output_name], {self.input_name: img})[0].flatten()
        end = time.time()
        inference_time = end - start

        # Get the top 5 predicted classes and their probabilities
        top_idx = np.argsort(output)[::-1][:5]
        top_prob = output[top_idx]

        # Create the predictions
        predictions = [f'{i + 1}. {self.labels[top_idx[i]]}: {top_prob[i]:.3f}' for i in range(5)]

        # Return the predictions and the inference time
        return {
            'predictions': predictions,
            'time': inference_time

4) Define and create the deployment in UbiOps

We will now use the Python client to define a new deployment in UbiOps with the input and output definitions as mentioned earlier

deployment_template = ubiops.DeploymentCreate(
    input_fields=[{"name": "file", "data_type": "file"}],
        {"name": "predictions", "data_type": "array_string"},
        {"name": "time", "data_type": "double"},

api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)

Create a deployment version

Next we create a version for the deployment. For the version we set the name, environment and size of the instance (we're using a GPU instance here, check if the instance type specified here is available!).

template_tensorrt_t4 = {
    "environment": "ubuntu22-04-python3-10-cuda11-7-1",
    "instance_type": "16384mb_t4",
    "maximum_instances": 1,
    "minimum_instances": 0,
    "maximum_idle_time": 300,

5) Package and upload the code

Now we have the deployment and version defined, we can upload the actual code to it. We zip and upload the folder containing the requirements.txt, ubiops.yaml and files. As we do this, UbiOps will build a container based on the settings above and install all packages defined in our requirements file.

This process can take a few minutes!

Tip: You can also check the status in the UbiOps browser UI by navigating to the deployment version and clicking the `logs` icon.
import shutil

zip_dir_tensorrt = shutil.make_archive(
    "deployment_package_tensorrt", "zip", ubiops_package_name_tensorrt
# Upload the directory with model files to UbiOps
upload_response = api.revisions_file_upload(

Wait for the deployment to be ready


6) Create a request to the model API in UbiOps to start a training run

Now it's time to make a request! Since we use a file as input, we first need to upload our file to UbiOps. Let's download a random sample image from the imagenet-sample-images Github repository.

import requests
import random

# Set url for Github API endpoint
url = ""

# Send request
data = requests.get(url).json()

# Get random image download url from the data
download_url = data[random.randint(2, len(data))]["download_url"]

# download the image
response = requests.get(download_url)

# save the image to disk
with open("image.jpg", "wb") as f:

Now we can see if the image has downloaded correctly and show it!

from IPython.display import Image


Let's upload this image to Ubiops and create a direct request!

ubiops_file_url = ubiops.utils.upload_file(
    client=client, project_name=PROJECT_NAME, file_path="image.jpg"
print(f"Image file: {ubiops_file_url}")

data = {"file": ubiops_file_url}

request_response = api.deployment_version_requests_create(
print("Request finished!")

Let's parse the output in a nice to read format!

for pred in request_response.result["predictions"]:
    print("- " + pred)
print("Time: " + str(request_response.result["time"]))

Try running a request multiple times to see the effect of the TensorRT start-up time and the inference time when the model has been built!

The preceding code has been benchmarked with different inference engines and hardware setups (different GPUs).
The start-up times and inference times can be seen in the following figures:



7) Cleanup

Close the connection with the UbiOps API client.