Skip to content

Deploying Llama 3.1 8B Instruct with streaming to UbiOps

Download notebook View source code

This notebook will show you how you can create a cloud-based inference API endpoint for the Llama-3.1-8B-Instruct model using UbiOps. The Llama model is already pre-trained and will be loaded from the Huggingface unsloth library. We will use Unsloth's 4bit quantized model, so that the model can run on an NVIDIA Ada Lovelace L4 GPU.

The Meta Llama 3.1 models are a collection of pre-trained and instruction tuned generative text models, in 8B, 70B and 405B parameter sizes. The instruction versions of these models are optimized for dialogue use cases. Meta claims that Llama 3.1B outperforms models with a similar size, like Mistral 7B & Gemma 7B on common industry benchmarks. The model deployed in this tutorial is the instruction tuned version of the Llama 8B model. We also optimize the inference speed using the flash attention library.

In this notebook, we will walk you through:

  1. Connecting with the UbiOps API client
  2. Creating a code environment for our deployment
  3. Creating a deployment for the Llama-3.1-8B-Instruct
  4. Calling the Llama 3.1 deployment endpoint and streaming the response

Llama 3.1 is a text-to-text model. Therefore we will make a deployment that takes a text prompt as input, and returns a response.

The deployment will return the tokens generated by the model in the output field. By making streaming callbacks, we can stream the tokens as they are generated. We will also return input, which is the user's prompt and system_prompt, and the used_config which specifies the token generation parameters.

Default pre-set values will be used for the system_prompt and config. You can find how to change these by checking the __init__ of the deployment.

Deployment input & output variables Variable name Data type
Input fields prompt string
Output fields output string
input string
used_config dictionary

Note that we deploy to a GPU instance by default, which is not accessible in every project. You can contact us about this.

Let's start coding!

1. Set up a connection with the UbiOps API client

To use the UbiOps API from our notebook, we need to install the UbiOps Python Client Library first:

%pip install -qU ubiops

Now we can set up a connection with your UbiOps environment. To do this we will need the name of your UbiOps project and an API token with the project_editor permissions.

You can paste your project name and API token in the code block below before running.

import ubiops
from datetime import datetime

API_TOKEN = "<INSERT API_TOKEN WITH PROJECT EDITOR RIGHTS>"  # Make sure this is in the format "Token token-code"
PROJECT_NAME = "<INSERT PROJECT NAME IN YOUR ACCOUNT>"

HF_TOKEN = "<ENTER YOUR HF TOKEN WITH ACCESS TO LLAMA REPO HERE>"  # We need this token to download the model from Huggingface

DEPLOYMENT_NAME = f"llama-3-1-8b-{datetime.now().date()}"
DEPLOYMENT_VERSION = "v1"

# Initialize client library
configuration = ubiops.Configuration(host="https://api.ubiops.com/v2.1")
configuration.api_key["Authorization"] = API_TOKEN

# Establish a connection
client = ubiops.ApiClient(configuration)
api = ubiops.CoreApi(client)
print(api.projects_get(PROJECT_NAME))

2. Setting up the environment

The environment that our model runs in can be managed separately. To do this we need to select a base environment, to which we will add additional dependencies.

environment_dir = "environment_package"
ENVIRONMENT_NAME = "llama-3-1-environment"
%mkdir {environment_dir}

We will define the Python packages required to run the model in a requirements.txt, which we will later upload to UbiOps.

%%writefile {environment_dir}/requirements.txt
# This file contains package requirements for the environment
# Installed via PIP.
torch==2.3.0+cu121
huggingface-hub==0.24.1
transformers==4.43.2
accelerate==0.33.0
bitsandbytes===0.43.2
scipy==1.14.0
diffusers==0.29.2
safetensors==0.4.3
ninja==1.11.1.1
ipywidgets==8.1.3
sentencepiece==0.2.0
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.2/flash_attn-2.6.2+cu123torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl 

Now we will create a ubiops.yaml to set a remote pip index. This ensures that we will install a CUDA-compatible version of PyTorch. CUDA allows models to be loaded and to run on GPUs.

%%writefile {environment_dir}/ubiops.yaml
environment_variables:
- PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu121
apt:
  packages:
  - git

Now we create a custom environment on UbiOps. We select Ubuntu 22.04 + Python 3.10 + CUDA 12.3.2 as the base_environment, and add the additional dependencies we defined earlier to the base_environment to create a custom environment. The environment will be called llama-3-1-environment.

api_response = api.environments_create(
    project_name=PROJECT_NAME,
    data=ubiops.EnvironmentCreate(
        name=ENVIRONMENT_NAME,
        base_environment="ubuntu22-04-python3-10-cuda12-3-2",
        description="Environment to run Llama-3.1 from Huggingface",
    ),
)

Package and upload the environment files.

import shutil

environment_archive = shutil.make_archive(environment_dir, "zip", ".", environment_dir)
api.environment_revisions_file_upload(
    project_name=PROJECT_NAME,
    environment_name=ENVIRONMENT_NAME,
    file=environment_archive,
)

3. Creating a deployment for the Llama 3.1 8B Instruct model

With the environment set up, we can start writing the code to run the Llama-3.1-8B model, and push it to UbiOps.

We will create a deployment.py with a Deployment class, which has two methods:

  • The __init__which will run when the deployment starts up. This method can be used to load models, data artefacts and other requirements for inference.
  • The request() will run every time a call is made to the models REST API endpoint and includes all the logic for processing data.

Separating the logic between the two methods will ensure fast model response times. The model will be loaded in the __init__ method, and the code that needs to be run when a call is made to the deployment in the request() method. This way the model only needs to be loaded in when the deployment starts up.

As mentioned in the introduction, we will add a default system_prompt and config to the input.

deployment_code_dir = "deployment_code"
!mkdir {deployment_code_dir}
%%writefile {deployment_code_dir}/deployment.py
import os
import torch
from transformers import (
    LlamaForCausalLM,
    AutoTokenizer,
    TextIteratorStreamer
)
from huggingface_hub import login
from threading import Thread


class Deployment:
    def __init__(self, base_directory, context):
        """
        Initialisation method for the deployment. Any code inside this method will execute when the deployment starts up.
        It can, for example, be used for loading modules that have to be stored in memory or setting up connections.
        """

        print("Initialising deployment")

        # Read model-related environment variables
        LLAMA_VERSION = os.environ.get('LLAMA_VERSION', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit')
        self.MAX_RESPONSE_LENGTH = float(os.environ.get('MAX_RESPONSE_LENGTH', 256))

        # Log in to Hugging Face
        HF_TOKEN = os.environ["HF_TOKEN"]
        login(token=HF_TOKEN)


        print("Downloading tokenizer")
        self.tokenizer = AutoTokenizer.from_pretrained(LLAMA_VERSION)
        print("Downloaded tokenizer")

        self.model = LlamaForCausalLM.from_pretrained(
            LLAMA_VERSION,
            torch_dtype=torch.bfloat16,
            device_map='auto',
            use_safetensors=True,
            attn_implementation="flash_attention_2",
        )

        self.default_config = {
            'do_sample': True,
            'max_new_tokens': self.MAX_RESPONSE_LENGTH,
            'eos_token_id': 128009,
            'temperature': 0.7
        }

        self.system_prompt = os.environ.get("SYSTEM_PROMPT", "You are a helpful assistant. Please respond to the user's query.")

    def request(self, data, context):
        """
        Method for deployment requests, called separately for each individual request.
        """
        print("Processing request")

        # Prepare the prompt with the system message and user input
        user_prompt = data["prompt"]
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_prompt},
        ]

        # Prepare the prompt and inputs
        prompt = self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to(torch.device("cuda"))

        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)
        streaming_callback = context["streaming_update"]

        generation_kwargs = dict(inputs, streamer=streamer, **self.default_config)

        # Start the generation process in a separate thread
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        callback_counter = 0  # Counter to track chunks

        for new_text in streamer:
            new_text = new_text.replace("<|eot_id|>", "").encode("unicode_escape").decode("utf-8")
            streaming_callback(new_text)
            response += new_text

        # Return the final response as JSON output
        return {"output": response, "input": messages, "used_config": self.default_config}

Create a UbiOps deployment

Now we can create the deployment, where we define the in- and outputs of the model. Each deployment can have multiple versions. For each version you can use a different deployed code, environment, instance type, among other settings.

# Create the deployment
deployment_template = ubiops.DeploymentCreate(
    name=DEPLOYMENT_NAME,
    input_type="structured",
    output_type="structured",
    input_fields=[
        {"name": "prompt", "data_type": "string"}
    ],
    output_fields=[
        {"name": "output", "data_type": "string"},
        {"name": "input", "data_type": "string"},
        {"name": "used_config", "data_type": "dict"},
    ],
)

api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)

Create a deployment version

Now we will create a version of the deployment. For the version, we need to define the name, environment, instance type (CPU or GPU) as well as the size of the instance.

For this model it is recommended to use a GPU instance.

# Create the version
version_template = ubiops.DeploymentVersionCreate(
    version=DEPLOYMENT_VERSION,
    environment=ENVIRONMENT_NAME,
    instance_type_group_name="16384 MB + 4 vCPU + NVIDIA Ada Lovelace L4",
    maximum_instances=1,
    minimum_instances=0,
    maximum_idle_time=600,  # = 10 minutes
    request_retention_mode="full",
)

api.deployment_versions_create(
    project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template
)

Package and upload the code:

import shutil

deployment_code_archive = shutil.make_archive(
    deployment_code_dir, "zip", deployment_code_dir
)

upload_response = api.revisions_file_upload(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    file=deployment_code_archive,
)
print(upload_response)

# Check if the deployment is finished building. This can take a few minutes
ubiops.utils.wait_for_deployment_version(
    client=api.api_client,
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    revision_id=upload_response.revision,
    stream_logs=True,
)

Before we can send requests to our deployment version, the environment has to be finished building. Note that building the environment might take a while as UbiOps needs to download and install all the packages and dependencies. The environment only needs to be built once, the next time that an instance type is spun up for our deployment the dependencies do not have to be installed anymore. You can toggle off stream_logs to not stream logs of the build process.

Create an environment variable

Here we create environment variables for our Huggingface token.

api_response = api.deployment_version_environment_variables_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    data=ubiops.EnvironmentVariableCreate(name="HF_TOKEN", value=HF_TOKEN, secret=True),
)

4. Calling the Llama 3.1 8B deployment API endpoint

Our deployment is now ready to process requests! We can send requests to the deployment using either the deployment-requests-create or batch-deployment-requests-create API endpoint. During this step a node will be spun up, and the model will be downloaded from Huggingface. Hence why this step can take a while. You can monitor the progress of the process in the logs. Subsequent requests to the deployment will be handled faster.

import pprint

data = {"prompt": "Can you tell me a joke?"}

result = api.deployment_requests_create(
    project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=data, timeout=3600
).result
pprint.pprint(result)

Now let's have the LLM respond to itself, and stream the tokens as they are generated.

data={"prompt": result["output"]}

# Create a streaming deployment request
for item in ubiops.utils.stream_deployment_request(
    client=api.api_client,
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    data=data,
    timeout=3600,
    full_response=False,
    ):
    print(item, end="")

All done! Let's close the client properly.

client.close()

So that's it! You now have your own on-demand, scalable Llama-3.1-8B-Instruct model running in the cloud, with a REST API that you can reach from anywhere!

For any questions, feel free to reach out to us via our customer service portal