Deploying Llama 3.1 8B Instruct with streaming to UbiOps¶
Download notebook View source code
This notebook will show you how you can create a cloud-based inference API endpoint for the Llama-3.1-8B-Instruct model using UbiOps. The Llama model is already pre-trained and will be loaded from the Huggingface unsloth library. We will use Unsloth's 4bit quantized model, so that the model can run on an NVIDIA Ada Lovelace L4 GPU.
The Meta Llama 3.1 models are a collection of pre-trained and instruction tuned generative text models, in 8B, 70B and 405B parameter sizes. The instruction versions of these models are optimized for dialogue use cases. Meta claims that Llama 3.1B outperforms models with a similar size, like Mistral 7B & Gemma 7B on common industry benchmarks. The model deployed in this tutorial is the instruction tuned version of the Llama 8B model. We also optimize the inference speed using the flash attention library.
In this notebook, we will walk you through:
- Connecting with the UbiOps API client
- Creating a code environment for our deployment
- Creating a deployment for the Llama-3.1-8B-Instruct
- Calling the Llama 3.1 deployment endpoint and streaming the response
Llama 3.1 is a text-to-text model. Therefore we will make a deployment that takes a text prompt as input, and returns a response.
The deployment will return the tokens generated by the model in the output
field. By making streaming callbacks, we can stream the tokens as they are generated. We will also return input
, which is the user's prompt
and system_prompt
, and the used_config
which specifies the token generation parameters.
Default pre-set values will be used for the system_prompt
and config
. You can find how to change these by checking the __init__
of the deployment.
Deployment input & output variables | Variable name | Data type |
---|---|---|
Input fields | prompt | string |
Output fields | output | string |
input | string | |
used_config | dictionary |
Note that we deploy to a GPU instance by default, which is not accessible in every project. You can contact us about this.
Let's start coding!
1. Set up a connection with the UbiOps API client¶
To use the UbiOps API from our notebook, we need to install the UbiOps Python Client Library first:
%pip install -qU ubiops
Now we can set up a connection with your UbiOps environment. To do this we will need the name of your UbiOps project and an API token with the project_editor
permissions.
You can paste your project name and API token in the code block below before running.
import ubiops
from datetime import datetime
API_TOKEN = "<INSERT API_TOKEN WITH PROJECT EDITOR RIGHTS>" # Make sure this is in the format "Token token-code"
PROJECT_NAME = "<INSERT PROJECT NAME IN YOUR ACCOUNT>"
HF_TOKEN = "<ENTER YOUR HF TOKEN WITH ACCESS TO LLAMA REPO HERE>" # We need this token to download the model from Huggingface
DEPLOYMENT_NAME = f"llama-3-1-8b-{datetime.now().date()}"
DEPLOYMENT_VERSION = "v1"
# Initialize client library
configuration = ubiops.Configuration(host="https://api.ubiops.com/v2.1")
configuration.api_key["Authorization"] = API_TOKEN
# Establish a connection
client = ubiops.ApiClient(configuration)
api = ubiops.CoreApi(client)
print(api.projects_get(PROJECT_NAME))
2. Setting up the environment¶
The environment that our model runs in can be managed separately. To do this we need to select a base environment, to which we will add additional dependencies.
environment_dir = "environment_package"
ENVIRONMENT_NAME = "llama-3-1-environment"
%mkdir {environment_dir}
We will define the Python packages required to run the model in a requirements.txt
, which we will later upload to UbiOps.
%%writefile {environment_dir}/requirements.txt
# This file contains package requirements for the environment
# Installed via PIP.
torch==2.3.0+cu121
huggingface-hub==0.24.1
transformers==4.43.2
accelerate==0.33.0
bitsandbytes===0.43.2
scipy==1.14.0
diffusers==0.29.2
safetensors==0.4.3
ninja==1.11.1.1
ipywidgets==8.1.3
sentencepiece==0.2.0
https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.2/flash_attn-2.6.2+cu123torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Now we will create a ubiops.yaml
to set a remote pip index. This ensures that we will install a CUDA-compatible version of PyTorch. CUDA allows models to be loaded and to run on GPUs.
%%writefile {environment_dir}/ubiops.yaml
environment_variables:
- PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu121
apt:
packages:
- git
Now we create a custom environment on UbiOps. We select Ubuntu 22.04 + Python 3.10 + CUDA 12.3.2
as the base_environment
, and add the additional dependencies we defined earlier to the base_environment
to create a custom environment
. The environment will be called llama-3-1-environment
.
api_response = api.environments_create(
project_name=PROJECT_NAME,
data=ubiops.EnvironmentCreate(
name=ENVIRONMENT_NAME,
base_environment="ubuntu22-04-python3-10-cuda12-3-2",
description="Environment to run Llama-3.1 from Huggingface",
),
)
Package and upload the environment files.
import shutil
environment_archive = shutil.make_archive(environment_dir, "zip", ".", environment_dir)
api.environment_revisions_file_upload(
project_name=PROJECT_NAME,
environment_name=ENVIRONMENT_NAME,
file=environment_archive,
)
3. Creating a deployment for the Llama 3.1 8B Instruct model¶
With the environment set up, we can start writing the code to run the Llama-3.1-8B model, and push it to UbiOps.
We will create a deployment.py
with a Deployment
class, which has two methods:
- The
__init__
which will run when the deployment starts up. This method can be used to load models, data artefacts and other requirements for inference. - The
request()
will run every time a call is made to the models REST API endpoint and includes all the logic for processing data.
Separating the logic between the two methods will ensure fast model response times. The model will be loaded in the __init__
method, and the code that needs to be run when a call is made to the deployment in the request()
method. This way the model only needs to be loaded in when the deployment starts up.
As mentioned in the introduction, we will add a default system_prompt
and config
to the input.
deployment_code_dir = "deployment_code"
!mkdir {deployment_code_dir}
%%writefile {deployment_code_dir}/deployment.py
import os
import torch
from transformers import (
LlamaForCausalLM,
AutoTokenizer,
TextIteratorStreamer
)
from huggingface_hub import login
from threading import Thread
class Deployment:
def __init__(self, base_directory, context):
"""
Initialisation method for the deployment. Any code inside this method will execute when the deployment starts up.
It can, for example, be used for loading modules that have to be stored in memory or setting up connections.
"""
print("Initialising deployment")
# Read model-related environment variables
LLAMA_VERSION = os.environ.get('LLAMA_VERSION', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit')
self.MAX_RESPONSE_LENGTH = float(os.environ.get('MAX_RESPONSE_LENGTH', 256))
# Log in to Hugging Face
HF_TOKEN = os.environ["HF_TOKEN"]
login(token=HF_TOKEN)
print("Downloading tokenizer")
self.tokenizer = AutoTokenizer.from_pretrained(LLAMA_VERSION)
print("Downloaded tokenizer")
self.model = LlamaForCausalLM.from_pretrained(
LLAMA_VERSION,
torch_dtype=torch.bfloat16,
device_map='auto',
use_safetensors=True,
attn_implementation="flash_attention_2",
)
self.default_config = {
'do_sample': True,
'max_new_tokens': self.MAX_RESPONSE_LENGTH,
'eos_token_id': 128009,
'temperature': 0.7
}
self.system_prompt = os.environ.get("SYSTEM_PROMPT", "You are a helpful assistant. Please respond to the user's query.")
def request(self, data, context):
"""
Method for deployment requests, called separately for each individual request.
"""
print("Processing request")
# Prepare the prompt with the system message and user input
user_prompt = data["prompt"]
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_prompt},
]
# Prepare the prompt and inputs
prompt = self.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to(torch.device("cuda"))
streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)
streaming_callback = context["streaming_update"]
generation_kwargs = dict(inputs, streamer=streamer, **self.default_config)
# Start the generation process in a separate thread
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()
response = ""
callback_counter = 0 # Counter to track chunks
for new_text in streamer:
new_text = new_text.replace("<|eot_id|>", "").encode("unicode_escape").decode("utf-8")
streaming_callback(new_text)
response += new_text
# Return the final response as JSON output
return {"output": response, "input": messages, "used_config": self.default_config}
Create a UbiOps deployment¶
Now we can create the deployment, where we define the in- and outputs of the model. Each deployment can have multiple versions. For each version you can use a different deployed code, environment, instance type, among other settings.
# Create the deployment
deployment_template = ubiops.DeploymentCreate(
name=DEPLOYMENT_NAME,
input_type="structured",
output_type="structured",
input_fields=[
{"name": "prompt", "data_type": "string"}
],
output_fields=[
{"name": "output", "data_type": "string"},
{"name": "input", "data_type": "string"},
{"name": "used_config", "data_type": "dict"},
],
)
api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)
Create a deployment version¶
Now we will create a version of the deployment. For the version, we need to define the name, environment, instance type (CPU or GPU) as well as the size of the instance.
For this model it is recommended to use a GPU instance.
# Create the version
version_template = ubiops.DeploymentVersionCreate(
version=DEPLOYMENT_VERSION,
environment=ENVIRONMENT_NAME,
instance_type_group_name="16384 MB + 4 vCPU + NVIDIA Ada Lovelace L4",
maximum_instances=1,
minimum_instances=0,
maximum_idle_time=600, # = 10 minutes
request_retention_mode="full",
)
api.deployment_versions_create(
project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template
)
Package and upload the code:
import shutil
deployment_code_archive = shutil.make_archive(
deployment_code_dir, "zip", deployment_code_dir
)
upload_response = api.revisions_file_upload(
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
file=deployment_code_archive,
)
print(upload_response)
# Check if the deployment is finished building. This can take a few minutes
ubiops.utils.wait_for_deployment_version(
client=api.api_client,
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
revision_id=upload_response.revision,
stream_logs=True,
)
Before we can send requests to our deployment version, the environment has to be finished building. Note that building the environment might take a while as UbiOps needs to download and install all the packages and dependencies. The environment only needs to be built once, the next time that an instance type is spun up for our deployment the dependencies do not have to be installed anymore. You can toggle off stream_logs
to not stream logs of the build process.
Create an environment variable¶
Here we create environment variables for our Huggingface token.
api_response = api.deployment_version_environment_variables_create(
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
data=ubiops.EnvironmentVariableCreate(name="HF_TOKEN", value=HF_TOKEN, secret=True),
)
4. Calling the Llama 3.1 8B deployment API endpoint¶
Our deployment is now ready to process requests! We can send requests to the deployment using either the deployment-requests-create
or batch-deployment-requests-create
API endpoint. During this step a node will be spun up, and the model will be downloaded from Huggingface. Hence why this step can take a while. You can monitor the progress of the process in the logs. Subsequent requests to the deployment will be handled faster.
import pprint
data = {"prompt": "Can you tell me a joke?"}
result = api.deployment_requests_create(
project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=data, timeout=3600
).result
pprint.pprint(result)
Now let's have the LLM respond to itself, and stream the tokens as they are generated.
data={"prompt": result["output"]}
# Create a streaming deployment request
for item in ubiops.utils.stream_deployment_request(
client=api.api_client,
project_name=PROJECT_NAME,
deployment_name=DEPLOYMENT_NAME,
version=DEPLOYMENT_VERSION,
data=data,
timeout=3600,
full_response=False,
):
print(item, end="")
All done! Let's close the client properly.
client.close()
So that's it! You now have your own on-demand, scalable Llama-3.1-8B-Instruct
model running in the cloud, with a REST API that you can reach from anywhere!
For any questions, feel free to reach out to us via our customer service portal