Run LiteLLM Proxy with UbiOps Services¶
This how-to shows you how to run a LiteLLM proxy server on UbiOps using Services. LiteLLM provides a unified interface to call 100+ LLMs using the OpenAI format, allowing you to route requests to multiple models through a single endpoint.
What You'll Build¶
After following this guide, you'll have: - A LiteLLM proxy server running on UbiOps that routes to multiple models - A public HTTPS endpoint (https://<service-id>.services.ubiops.com) - Automatic authentication using UbiOps tokens - Permission-based access control: users can only call models they have access to
Deployment Package Structure¶
Create a deployment package with three files:
requirements.txt
litellm[proxy]
requests
config.yaml
model_list:
# First UbiOps model
- model_name: <model-alias-1>
litellm_params:
model: openai/ubiops-deployment/<deployment-name-1>//<version-name-1>
api_base: https://api.ubiops.com/chat/openai-compatible/v1
# Second UbiOps model
- model_name: <model-alias-2>
litellm_params:
model: openai/ubiops-deployment/<deployment-name-2>//<version-name-2>
api_base: https://api.ubiops.com/chat/openai-compatible/v1
litellm_settings:
drop_params: true
set_verbose: false
Authentication: When you don't set api_key in the config, LiteLLM propagates the end user's API key to the underlying models. Users with authentication to the LiteLLM service will see all configured models, but can only successfully request models if their UbiOps permissions allow access to those specific deployments.
deployment.py
import subprocess
import time
import requests
import logging
logging.basicConfig(level=logging.INFO)
class Deployment():
def __init__(self, base_directory, context):
self.port = 8000
self.url = f"http://0.0.0.0:{self.port}"
config_path = f"{base_directory}/config.yaml"
logging.info("Starting LiteLLM proxy server...")
self.process = subprocess.Popen(
["litellm", "--config", config_path, "--port", str(self.port)],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True
)
self.wait_for_server()
logging.info("LiteLLM proxy ready!")
def request(self, data):
"""Health check - Service requests bypass this method"""
try:
response = requests.get(f"{self.url}/health", timeout=5)
return {"status": "healthy", "code": response.status_code}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
def wait_for_server(self):
"""Wait for LiteLLM server to be ready"""
max_retries = 60
for _ in range(max_retries):
if self.process.poll() is not None:
raise RuntimeError(f"LiteLLM process exited: {self.process.poll()}")
try:
response = requests.get(f"{self.url}/health", timeout=5)
if response.status_code == 200:
return
except requests.exceptions.RequestException:
time.sleep(5)
raise RuntimeError("LiteLLM server failed to start")
Deploy to UbiOps¶
- Create a deployment with
input_type: plainandoutput_type: plain - Create a version with Python 3.12 environment
- Upload your deployment package (zip the three files)
- Wait for the build to complete
- Create a Service pointing to the deployment:
- Set
port: 8000(where LiteLLM listens) - Enable
authentication_required: true
The Service will provide a public URL: https://<service-id>.services.ubiops.com
Example Requests¶
Once deployed, you can call any configured model through the same endpoint using the OpenAI Python client:
from openai import OpenAI
# Initialize client pointing to your Service
client = OpenAI(
base_url="https://<service-id>.services.ubiops.com/v1",
api_key="not-needed",
default_headers={"Authorization": "<YOUR_UBIOPS_TOKEN>"}
)
# Call your first model
response = client.chat.completions.create(
model="my-mistral-model",
messages=[{"role": "user", "content": "Explain quantum computing in one sentence"}]
)
print(response.choices[0].message.content)
# Output: "Quantum computing uses quantum mechanics principles like superposition
# and entanglement to perform calculations exponentially faster than classical
# computers for certain problems."
# Call your second model through the same proxy
response = client.chat.completions.create(
model="my-llama-model",
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)
# Output: "The capital of France is Paris."
You can also use streaming responses:
stream = client.chat.completions.create(
model="my-mistral-model",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Output: "1, 2, 3, 4, 5"
Benefits¶
- Single endpoint for multiple models
- Permission-based access - users can only call models they have UbiOps access to
- OpenAI compatibility - use standard OpenAI clients and tools
OpenTelemetry Metrics (Optional)¶
Add observability by including the OpenTelemetry SDK in requirements.txt:
litellm[proxy]
requests
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp
Then configure the collector in config.yaml:
litellm_settings:
success_callback: ["otel"]
otel_config:
exporter: otlp_http
endpoint: "http://your-collector:4318"
This pushes metrics to your observability platform for monitoring latency and usage patterns across all models.