Skip to content

Run LiteLLM Proxy with UbiOps Services

This how-to shows you how to run a LiteLLM proxy server on UbiOps using Services. LiteLLM provides a unified interface to call 100+ LLMs using the OpenAI format, allowing you to route requests to multiple models through a single endpoint.

What You'll Build

After following this guide, you'll have: - A LiteLLM proxy server running on UbiOps that routes to multiple models - A public HTTPS endpoint (https://<service-id>.services.ubiops.com) - Automatic authentication using UbiOps tokens - Permission-based access control: users can only call models they have access to

Deployment Package Structure

Create a deployment package with three files:

requirements.txt

litellm[proxy]
requests

config.yaml

model_list:
  # First UbiOps model
  - model_name: <model-alias-1>
    litellm_params:
      model: openai/ubiops-deployment/<deployment-name-1>//<version-name-1>
      api_base: https://api.ubiops.com/chat/openai-compatible/v1

  # Second UbiOps model
  - model_name: <model-alias-2>
    litellm_params:
      model: openai/ubiops-deployment/<deployment-name-2>//<version-name-2>
      api_base: https://api.ubiops.com/chat/openai-compatible/v1

litellm_settings:
  drop_params: true
  set_verbose: false

Authentication: When you don't set api_key in the config, LiteLLM propagates the end user's API key to the underlying models. Users with authentication to the LiteLLM service will see all configured models, but can only successfully request models if their UbiOps permissions allow access to those specific deployments.

deployment.py

import subprocess
import time
import requests
import logging

logging.basicConfig(level=logging.INFO)

class Deployment():
    def __init__(self, base_directory, context):
        self.port = 8000
        self.url = f"http://0.0.0.0:{self.port}"
        config_path = f"{base_directory}/config.yaml"

        logging.info("Starting LiteLLM proxy server...")
        self.process = subprocess.Popen(
            ["litellm", "--config", config_path, "--port", str(self.port)],
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True
        )

        self.wait_for_server()
        logging.info("LiteLLM proxy ready!")

    def request(self, data):
        """Health check - Service requests bypass this method"""
        try:
            response = requests.get(f"{self.url}/health", timeout=5)
            return {"status": "healthy", "code": response.status_code}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

    def wait_for_server(self):
        """Wait for LiteLLM server to be ready"""
        max_retries = 60

        for _ in range(max_retries):
            if self.process.poll() is not None:
                raise RuntimeError(f"LiteLLM process exited: {self.process.poll()}")

            try:
                response = requests.get(f"{self.url}/health", timeout=5)
                if response.status_code == 200:
                    return
            except requests.exceptions.RequestException:
                time.sleep(5)

        raise RuntimeError("LiteLLM server failed to start")

Deploy to UbiOps

  1. Create a deployment with input_type: plain and output_type: plain
  2. Create a version with Python 3.12 environment
  3. Upload your deployment package (zip the three files)
  4. Wait for the build to complete
  5. Create a Service pointing to the deployment:
  6. Set port: 8000 (where LiteLLM listens)
  7. Enable authentication_required: true

The Service will provide a public URL: https://<service-id>.services.ubiops.com

Example Requests

Once deployed, you can call any configured model through the same endpoint using the OpenAI Python client:

from openai import OpenAI

# Initialize client pointing to your Service
client = OpenAI(
    base_url="https://<service-id>.services.ubiops.com/v1",
    api_key="not-needed",
    default_headers={"Authorization": "<YOUR_UBIOPS_TOKEN>"}
)

# Call your first model
response = client.chat.completions.create(
    model="my-mistral-model",
    messages=[{"role": "user", "content": "Explain quantum computing in one sentence"}]
)
print(response.choices[0].message.content)
# Output: "Quantum computing uses quantum mechanics principles like superposition 
# and entanglement to perform calculations exponentially faster than classical 
# computers for certain problems."

# Call your second model through the same proxy
response = client.chat.completions.create(
    model="my-llama-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)
# Output: "The capital of France is Paris."

You can also use streaming responses:

stream = client.chat.completions.create(
    model="my-mistral-model",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Output: "1, 2, 3, 4, 5"

Benefits

  • Single endpoint for multiple models
  • Permission-based access - users can only call models they have UbiOps access to
  • OpenAI compatibility - use standard OpenAI clients and tools

OpenTelemetry Metrics (Optional)

Add observability by including the OpenTelemetry SDK in requirements.txt:

litellm[proxy]
requests
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp

Then configure the collector in config.yaml:

litellm_settings:
  success_callback: ["otel"]

otel_config:
  exporter: otlp_http
  endpoint: "http://your-collector:4318"

This pushes metrics to your observability platform for monitoring latency and usage patterns across all models.