Run NVIDIA Triton Inference Server¶
This how-to will outline how to run a NVIDIA Triton Inference Server inside Ubiops. The NVIDIA Triton Inference Server will be deployed with PyTriton, a Python client for the Triton Inference Server. This can allow you to use one deployment to host multiple models. We will show how to do the following steps: 1. Setup an UbiOps environment 2. Create a Triton Inference Server deployment and bind model(s) to the Triton server 3. Create the UbiOps deployment request
method
Environment Setup¶
We need the nvidia-pytriton
package to set-up a Triton server. Therefore, we build our environment by adding a requirements.txt
with at least the following:
nvidia-pytriton
More packages can be added to the requirements.txt
file if needed.
Create a Triton Inference Server deployment and bind model(s) to the Triton server¶
Setting up a (basic) Triton server consists of the following steps:
- Create a
Triton
object - Bind a model to the
Triton
object - Run/serve the
Triton
object
The implementation of these steps for a UbiOps deployment is shown in the following code block:
from pytriton.triton import Triton
from pytriton.model_config import ModelConfig
class Deployment:
def __init__(self):
# Step 1: Call the Triton constructor
self.triton = Triton()
# Step 2: Bind a model to the Triton object
self.triton.bind(
model_name="Your Model name", # TODO: Add your model name
infer_func=self.your_infer_function, # TODO: Add your infer function
inputs=[
# TODO: Add your input tensors
],
outputs=[
# TODO: Add your output tensors
],
config=ModelConfig() # TODO: Add your model config
)
# Step 3: Run the Triton object
self.triton.run()
Create request
method¶
Finally, we construct the request
method of the deployment. This method is called when a request is made to the deployment. We would like to select the model that is used for the inference and the data that is sent to this specific model. Therefore, we need to add the two following inputs to the deployment:
Input | Type |
---|---|
json_data | String |
model_name | String |
The json_data
input contains the data that is sent to Triton Server. The model_name
input contains the name of the model that is used for the inference. The request
method looks like this:
import requests
class Deployment:
def request(self, data):
model_name = data["model_name"]
json_data = data["json_data"]
print(f"Received json_data: {json_data}")
url = f"http://localhost:8000/v2/models/{model_name}/infer"
headers = {"Content-Type": "application/json"}
response = requests.post(url, headers=headers, data=json_data)
return {"output": response.text}
Final code¶
The final code looks as follows:
from pytriton.triton import Triton
from pytriton.model_config import ModelConfig
import requests
class Deployment:
def __init__(self):
# Step 1: Call the Triton constructor
self.triton = Triton()
# Step 2: Bind a model to the Triton object
self.triton.bind(
model_name="Your Model name", # TODO: Add your model name
infer_func=self.your_infer_function, # TODO: Add your infer function
inputs=[
# TODO: Add your input tensors
],
outputs=[
# TODO: Add your output tensors
],
config=ModelConfig() # TODO: Add your model config
)
# Step 3: Run the Triton object
self.triton.run()
def request(self, data):
model_name = data["model_name"]
json_data = data["json_data"]
print(f"Received json_data: {json_data}")
url = f"http://localhost:8000/v2/models/{model_name}/infer"
headers = {"Content-Type": "application/json"}
response = requests.post(url, headers=headers, data=json_data)
return {"output": response.text}
We have now created a deployment (and environment) that runs a NVIDIA Triton Inference Server inside UbiOps. This deployment is able to host multiple models at the same time in a single deployment, with many more benefits! By deploying to UbiOps, you can easily scale your deployment, monitor your deployment and much more.
Don't hesitate to contact us if you have any questions!