Deploy an Open-AI compatible Ollama inference server on UbiOps¶
Download notebook View source code
In this tutorial, we will explain how to run LLMs supported by Ollama on UbiOps. Developers distribute Ollama by publishing a custom install.sh
script. This script allows creating custom Docker images with Ollama by running install.sh
in a Dockerfile
. We will create a custom environment based on a UbiOps base environment (so that it supports the requests format) and deploy it using the bring-your-own-image feature. Finally, we will make requests to Ollama server using Ubiops requests
and OpenAI python package.
1. Set up a connection with the UbiOps API client¶
First, we need to install the UbiOps Python Client Library to interface with UbiOps from Python:
!pip install -qU ubiops openai
Now, we need to initialize all the necessary variables for the UbiOps deployment and set up the deployment directory, which we will later zip and upload to UbiOps.
API_TOKEN = "<INSERT API TOKEN WITH PROJECT EDITOR RIGHTS>"
PROJECT_NAME = "<INSERT YOUR PROJECT NAME>"
DEPLOYMENT_NAME = "ollama-server"
ENVIRONMENT_NAME = "ollama-environment"
DEPLOYMENT_VERSION = "v1" # Choose a name for the version.
INSTANCE_TYPE = "16384 MB + 4 vCPU (Dedicated)"
print(f"Your new deployment will be called: {DEPLOYMENT_NAME}.")
Next, let's initialize the UbiOps client.
import ubiops
configuration = ubiops.Configuration(host="https://api.ubiops.com/v2.1")
configuration.api_key["Authorization"] = API_TOKEN
client = ubiops.ApiClient(configuration)
api = ubiops.CoreApi(client)
api.service_status()
2. Create Ollama image¶
Before running this step, ensure that both the Docker client and engine are installed on your machine: https://docs.docker.com/engine/install/
Pull a base image with a UbiOps agent:
!docker pull <registry>/ubiops-deployment-instance-ubuntu24.04-python3.11:v5.8.0
Create a Dockerfile
that installs Ollama and the OpenAI client on top of the base image. It is important to use the base image provided by UbiOps because it includes an agent implementation that will start your deployment code when a request arrives. If you do not have access to a registry that contains UbiOps base environments, please contact your account manager, or reach out to our support portal.
docker_file = """
FROM <registry>/ubiops-deployment-instance-ubuntu24.04-python3.11:v5.8.0
USER root
RUN apt-get update && \
apt-get install --no-install-recommends -y git curl && \
apt-get -y autoremove && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN curl -fsSL https://ollama.com/install.sh | sh
USER deployment
RUN pip install urllib3==1.26.19 jsonschema==3.2.0 django==5.1.4
RUN pip install ollama openai
"""
with open("Dockerfile", "w") as f:
f.write(docker_file)
Build the new image and save it as a tar
archive.
!docker build . -t ollama-ubiops
!docker save -o ollama-ubiops.tar.gz ollama-ubiops
3. Creating an Environment¶
Environment with an Ollama server that supports request format.
Installing Ollama is not possible using environment files. We need a custom script to install it, and therefore we will need to use the bring-your-own-image functionality. We still want to make use of the managed request
endpoint, to make use of functionalities such as autoscaling and request logging.
data = ubiops.EnvironmentCreate(
name=ENVIRONMENT_NAME,
description="Environment with an ollama server that supports requests format",
supports_request_format=True,
)
api.environments_create(PROJECT_NAME, data)
api_response = api.environment_revisions_file_upload(
PROJECT_NAME,
ENVIRONMENT_NAME,
file="./ollama-ubiops.tar.gz"
)
ubiops.utils.wait_for_environment(client, PROJECT_NAME, ENVIRONMENT_NAME)
api_response
4. Creating a UbiOps deployment¶
In this section, we will create the UbiOps deployment.
4.1 Create UbiOps deployment¶
Now we can create the deployment, where we define the inputs and outputs of the model. Each deployment can have multiple versions. For each version, you can deploy different code, environments, instance types, etc.
The deployment will have supports_request_format
enabled to allow autoscaling and monitoring of requests. We use the request endpoint to pass payloads to the openai compatible chat completions endpoint. Therefore we will use input and output datatypes plain
:
Type | Data Type |
---|---|
Input | Plain |
Output | Plain |
deployment = api.deployments_create(
project_name=PROJECT_NAME,
data={
"name": DEPLOYMENT_NAME,
"description": "Ollama deployment",
"supports_request_format": True,
"input_type": "plain",
"output_type": "plain",
}
)
print(deployment)
4.2 Create a deployment version¶
Next we create a version for the deployment. For the version we set the name, environment and size of the instance (we're using a GPU instance here, check if the instance type specified here is available!).
version_template = {
"version": DEPLOYMENT_VERSION,
"environment": ENVIRONMENT_NAME,
"instance_type_group_name": INSTANCE_TYPE,
"maximum_instances": 1,
"minimum_instances": 0,
"instance_processes": 5,
"maximum_idle_time": 900,
}
deployment_version = api.deployment_versions_create(
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
data=version_template,
)
print(deployment_version)
4.4 Creating a deployment directory¶
Let's create a deployment package directory, where we will add our deployment package files
import os
dir_name = "deployment_package"
# Create directory for the deployment if it does not exist
os.makedirs(dir_name, exist_ok=True)
4.5 Creating Deployment Code for UbiOps¶
We will now create the deployment code that will run on UbiOps. This involves creating a deployment.py
file containing a Deployment
class with two key methods:
-
__init__
Method
This method runs when the deployment starts. It can be used to load models, data artifacts, and other requirements for inference. -
request()
Method
This method executes every time a call is made to the model's REST API endpoint. It contains the logic for processing incoming data.
We will configure instance_processes
to 10, allowing each deployment instance to handle 10 concurrent requests. The Ollama server will be loaded as a background process within the __init__
of the first process. A client will also be initialized in each process to proxy requests from all running processes to the Ollama.
These environment variables will be set to optimize Ollama’s behavior: - OLLAMA_KEEP_ALIVE=-1
: will keep model always loaded in memory. - OLLAMA_HOST=0.0.0.0:11434
: will serve Ollama on a public port. So, it can be also exposed through port forwarding.
For a complete overview of the deployment code structure, refer to the UbiOps documentation.
%%writefile {dir_name}/deployment.py
import subprocess
import os
import logging
import json
import time
from openai import OpenAI, BadRequestError
logging.basicConfig(level=logging.INFO)
import ollama
class PublicError(Exception):
def __init__(self, public_error_message):
super().__init__()
self.public_error_message = public_error_message
class Deployment:
def __init__(self, base_directory, context):
print("Initializing deployment")
self.model_name = os.environ.get("MODEL_NAME", "smollm")
# In every process, initiate a client
self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
self.envs = {"OLLAMA_KEEP_ALIVE":"-1", "OLLAMA_HOST": "0.0.0.0:11434"}
if context["process_id"] == 0:
print("Initializing Ollama server...")
#Serve ollama as a background process
subprocess.Popen(["ollama", "serve"], env=self.envs | os.environ)
time.sleep(5) # wait fo rollama to be served
ollama.pull(self.model_name)
self.poll_health_endpoint()
def request(self, data, context):
"""
Processes incoming requests using the OpenAI-compatible API.
"""
print("Processing request")
input_data = json.loads(data)
stream_boolean = input_data.get("stream", False) # Default to streaming
input_data["model"] = self.model_name
if stream_boolean:
input_data["stream_options"] = {"include_usage": True}
try:
response = self.client.chat.completions.create(**input_data)
except BadRequestError as e:
raise PublicError(str(e))
if stream_boolean:
streaming_callback = context["streaming_update"]
full_response = []
for partial_response in response:
chunk_dump = partial_response.model_dump()
streaming_callback(json.dumps(chunk_dump))
full_response.append(chunk_dump)
return full_response
return response.model_dump()
def poll_health_endpoint(self):
"""
Curls the Ollama server to ensure it's initialized.
"""
print("Waiting for Ollama server to be ready...")
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": "You warmed up yet?"}
],
stream = False
)
print(f"{self.model_name}'s first response: \n {response}")
except RuntimeError as e:
print(f"Runtime error: {e}")
raise # Exit on error and raise exception
import shutil
# Archive the deployment directory
deployment_zip_path = shutil.make_archive(dir_name, 'zip', dir_name)
4.6 Upload a revision¶
We will now upload the deployment to UbiOps. In the background, This step will take some time, because UbiOps interprets the environment files and builds a docker container out of it. You can check the UI for any progress.
upload_response = api.revisions_file_upload(
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
file=dir_name+".zip",
)
print(upload_response)
# Check if the deployment is finished building. This can take a few minutes
ubiops.utils.wait_for_deployment_version(
client=api.api_client,
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
revision_id=upload_response.revision,
)
5. Making requests to the deployment¶
Our deployment is now live on UbiOps! Let's test it out by sending a bunch of requests to it. This request will be a simple prompt to the model, asking it to respond to a question. In case your deployment still needs to scale, it may take some time before your first request is picked up. You can check the logs of your deployment version to see if the Ollama server is ready to accept requests.
5.1 Send a batch of requests¶
This section sends a batch of duplicate requests. It allows you to observe how Ollama fetches and processes multiple requests simultaneously.
import json
request_template = {
"messages": [
{
"content": "You are a helpful assistant.",
"role": "system"
},
{
"content": "{question}",
"role": "user"
}
],
"stream": False
}
questions = [
"What is the weather like today?",
"How do I cook pasta?",
"Can you explain quantum physics?",
"What is the capital of France?",
"How do I learn Python?"
]
requests_data = []
for question in questions:
filled_request = request_template.copy()
filled_request['messages'][1]['content'] = question
requests_data.append(filled_request)
# Print the resulting requests
print(json.dumps(requests_data, indent=2))
send_plain_batch = [json.dumps(item) for item in requests_data]
requests = api.batch_deployment_requests_create(
project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=send_plain_batch, timeout=3600
)
print(api.deployment_requests_create(
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
data=requests_data[0]
))
5.2 Sending a request with streaming output¶
For this request, we will add the key stream: true
to the input, enabling streaming responses
request_data = {
"messages": [
{
"content": "You are a helpful assistant.",
"role": "system"
},
{
"content": "How is the weather?",
"role": "user"
}
],
"stream": True
}
# Create a streaming deployment request
for item in ubiops.utils.stream_deployment_request(
client=api.api_client,
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
data=request_data,
timeout=3600,
full_response=False,
):
item_dict = json.loads(item)
if item_dict.get("choices"):
print(item_dict["choices"][0]["delta"]["content"], end="")
5.3 Sending requests to the OpenAI Endpoint¶
We can also connect to this deployment with the UbiOps OpenAI endpoint. Let's send the same messages, but through the OpenAI endpoint!
Install OpenAI client:
from openai import OpenAI
client = OpenAI(
api_key=API_TOKEN.lstrip("Token "), # This is the default and can be omitted
base_url=f"https://api.ubiops.com/v2.1/projects/{PROJECT_NAME}/openai-compatible/v1/"
)
stream_var = False
response = client.chat.completions.create(
model=f"ubiops-deployment/{DEPLOYMENT_NAME}/{DEPLOYMENT_VERSION}/openaiwrapper/test-model",
messages=[{"role": "user", "content": "Can you tell me more about openai in exactly two lines"}],
stream=stream_var
)
if stream_var:
for chunk in response:
if hasattr(chunk, 'choices') and chunk.choices:
print(chunk.choices[0].delta.content, end="") # Extract and print only the text
else:
if hasattr(response, 'choices') and response.choices:
print(response.choices[0].message.content)
6. Cleanup¶
At last, let's close our connection to UbiOps
client.close()
We have set up a deployment that hosts an Ollama server. This tutorial just serves as an example. Feel free to reach out to our support portal if you want to discuss your set-up in more detail.