Skip to content

Run NVIDIA Triton Inference Server

This how-to will outline how to run a NVIDIA Triton Inference Server inside Ubiops. The NVIDIA Triton Inference Server will be deployed with PyTriton, a Python client for the Triton Inference Server. This can allow you to use one deployment to host multiple models. We will show how to do the following steps: 1. Setup an UbiOps environment 2. Create a Triton Inference Server deployment and bind model(s) to the Triton server 3. Create the UbiOps deployment request method

Environment Setup

We need the nvidia-pytriton package to set-up a Triton server. Therefore, we build our environment by adding a requirements.txt with at least the following:

nvidia-pytriton

More packages can be added to the requirements.txt file if needed.

Create a Triton Inference Server deployment and bind model(s) to the Triton server

Setting up a (basic) Triton server consists of the following steps:

  1. Create a Triton object
  2. Bind a model to the Triton object
  3. Run/serve the Triton object

The implementation of these steps for a UbiOps deployment is shown in the following code block:

from pytriton.triton import Triton
from pytriton.model_config import ModelConfig


class Deployment:
    def __init__(self):
        # Step 1: Call the Triton constructor
        self.triton = Triton()

        # Step 2: Bind a model to the Triton object
        self.triton.bind(
            model_name="Your Model name", # TODO: Add your model name
            infer_func=self.your_infer_function, # TODO: Add your infer function
            inputs=[
                # TODO: Add your input tensors
            ],
            outputs=[
                # TODO: Add your output tensors
            ],
            config=ModelConfig() # TODO: Add your model config
        )

        # Step 3: Run the Triton object
        self.triton.run()

Create request method

Finally, we construct the request method of the deployment. This method is called when a request is made to the deployment. We would like to select the model that is used for the inference and the data that is sent to this specific model. Therefore, we need to add the two following inputs to the deployment:

Input Type
json_data String
model_name String

The json_data input contains the data that is sent to Triton Server. The model_name input contains the name of the model that is used for the inference. The request method looks like this:

import requests

class Deployment:
    def request(self, data):
        model_name = data["model_name"]
        json_data = data["json_data"]
        print(f"Received json_data: {json_data}")
        url = f"http://localhost:8000/v2/models/{model_name}/infer"
        headers = {"Content-Type": "application/json"}
        response = requests.post(url, headers=headers, data=json_data)

        return {"output": response.text}

Final code

The final code looks as follows:

from pytriton.triton import Triton
from pytriton.model_config import ModelConfig
import requests


class Deployment:
    def __init__(self):
        # Step 1: Call the Triton constructor
        self.triton = Triton()

        # Step 2: Bind a model to the Triton object
        self.triton.bind(
            model_name="Your Model name", # TODO: Add your model name
            infer_func=self.your_infer_function, # TODO: Add your infer function
            inputs=[
                # TODO: Add your input tensors
            ],
            outputs=[
                # TODO: Add your output tensors
            ],
            config=ModelConfig() # TODO: Add your model config
        )

        # Step 3: Run the Triton object
        self.triton.run()

    def request(self, data):
        model_name = data["model_name"]
        json_data = data["json_data"]
        print(f"Received json_data: {json_data}")
        url = f"http://localhost:8000/v2/models/{model_name}/infer"
        headers = {"Content-Type": "application/json"}
        response = requests.post(url, headers=headers, data=json_data)

        return {"output": response.text}

We have now created a deployment (and environment) that runs a NVIDIA Triton Inference Server inside UbiOps. This deployment is able to host multiple models at the same time in a single deployment, with many more benefits! By deploying to UbiOps, you can easily scale your deployment, monitor your deployment and much more.
Don't hesitate to contact us if you have any questions!