Deploy Llama 3.1 8B Instruct on UbiOps

In this guide, we’ll take you through the release of the new update from MetaAI. This update saw changes to the existing Llama 3 8B & 70B models, while also releasing a new model with 405B parameters (Llama 3.1 405B). We’ll also deploy a quantized version of the Llama 3.1 8B Instruct model to UbiOps.

What can you get out of this guide?

In this guide, we’ll explain how you can deploy Llama 3.1 8B Instruct on UbiOps in 4 steps:

Create a UbiOps account
Accept Meta’s license agreement and retrieve your HuggingFace token
Create your Llama 3.1 8B Instruct deployment & deployment version
Make an API call to Llama 3.1 8B Instruct

To successfully complete this guide, make sure you have:

A UbiOps account

You’ll also need to the following files:

Llama-3.1-8B-Instruct deployment package

What is Llama 3.1?

Updates to Llama 8B & 70B versions

Recently MetaAI released an updated version of Llama 3.0, named Llama 3.1. This update saw changes to their previously released Llama 3 8B & 70B models. Increasing their context window from 8k tokens to 128k window, and added support across eight languages. Furthermore, the overall reasoning capabilities of the model have been improved, while also providing state-of-the-art tool use. This new update enables the newly updated versions to be used for advanced use cases, such as:

Long-form text summarization
Coding assistants
Multilingual conversational agents

Meta released the following table that shows how the updated version of Llama 3.1 compares to models that have a similar number of parameters:

Source: meta-llama-3-1

Release of Llama 3.1 405B

The update also saw the release of Llama 3.1 405B. Meta states that this new version ushers a new era in GenAI with open-source leading the way. Meta states that this new version of Llama is the first open-source model available that can rival the top AI models on:

General knowledge
Math
Steerability
Tool use
Multilingual translation

Here’s a table that compares the 405B model with some of its top (proprietary) rivals:

Source: meta-llama-3-1

You can see in the table above that for most benchmarks Llama 3.1 405B scores the highest, and comes close to the other (closed) models on all other benchmarks. It should be noted though that these results need to be taken with a grain of salt since they come from MetaAI itself.

Lastly, MetaAI also changed its license. It is now allowed to improve other models with the output(s) from Llama models.

What is UbiOps?

For this blog post, we’ll be deploying the updated version of the 8B-Instruct model on UbiOps, which is a platform where you can deploy your AI products, like Stable diffusion or the newly released Llama 3.1, to a production environment. In the background, UbiOps also ensures that your model is always available (having an uptime of 99.99%) and takes care of auto-scaling. By letting you run your model on state-of-the-art hardware UbiOps also provides the fastest inference time. Working with private data also isn’t a problem with UbiOps, which provides on-premise, hybrid, or cloud solutions for you to run your model on.

How to deploy Llama 3.1 8B instruct on Ubiops

The first step is to create a free UbiOps account. You can simply use your email address to sign up and within a few clicks, you’ll be good to go. Note that your account needs to have access to GPUs to run this model.

Ubiops works with an organizational structure. Each organization can contain one or more projects. Within these projects, you can deploy your containerized code, which at UbiOps are called deployments. You can also orchestrate your workflow by chaining deployments together, by creating a pipeline.

Create a project

Head over to the UbiOps WebApp and click on “Create new project”. You can let UbiOps generate a name for your project, or choose your own unique name.

Now that we have our project up and running we can start deploying our model.

Create your Llama 3.1 8B Instruct deployment

Navigate to the “Deployments” tab on the left-hand side and click on “+Create your first deployment” if it’s your first deployment, or on “+ Create” on the top right if you already have a deployment inside your project.

You’ll then be prompted to define the name of the deployment, as well as the input(s) and output(s). For the name, you can pick anything you like, for this example, we’ll be naming our deployment llama-3-1-8b-instruct.

The in- & outputs fields of the deployment define what date the deployment expects when running the model, i.e., making a request. We’ll define three inputs for this deployment:

prompt: which is the user input

system_prompt: where you can give additional instructions to the model
config: where you can define extra configurations like temperature or max_new_tokens. Click here for a full list of parameters and their description.

You can use the following parameters for the input fields:

Name	Data type
prompt	String
system_prompt	String
config	Dictionary

For the output parameters, you can use the following parameters:

Name	Data type
output	String
input	String
used_config	Dictionary

After filling in the parameters shown above, UbiOps will generate deployment code snippets that can be downloaded or copied. These snippets can be used to create your deployment package. You can ignore these for this guide, as we’ll be providing you with the deployment package.

Now you can scroll down and click on “Next: Create a version” to finish building your deployment.

Create a deployment version

Upload the Llama 3.1 8B-Instruct deployment package, which contains three files:

The deployment file, which contains the code which retrieves the model from HuggingFace, loads it on the specified instance type and configures how it handles requests, i.e., prompts.
The requirements.txt, specifies additional dependencies that we need to run the model.
A ubiops.yaml file, which we’ll use to download a specific CUDA version.

For the base environment select “Ubuntu 22.04 + Python 3.10 + CUDA 12.3.2”, this is the environment to which the additional dependencies in the requirements.txt will be added. The base environment and additional dependencies will create a custom environment in which the model will run. You can also create a custom environment first, as shown in the Deploy Mistral guide, here the environment is created explicitly. For this blog post, we created the environment implicitly.

We’ll need to run the model on a GPU, toggle “Enable accelerated hardware” and select the 16384MB + 4 vCPU + NVIDIA Ada Lovelace L4 instance. Now you can scroll down and click on “Create”, after which UbiOps will start building your deployment version!

Retrieve your Llama 3.1 token & create an environment variable

Llama 3.1 8B Instruct is behind a gated repo, so to be able to use it we’ll need to sign in to HuggingFace and accept Meta’s license agreement on the Meta-Llama-3,1-8B-Instruct page. Note that if you already have access to the Llama 3 version you’ll need to accept the user agreement again to be able to use any of the Llama 3.1 versions.

After accepting, navigate within HuggingFace to Settings->Access tokens->New Token and generate a token with the “read” permission. Copy this token to your clipboard.

Now navigate back to your deployments and head over to the Environment Variables tab. Click on “+Create variable”, name the variable “HF_TOKEN” and paste the HuggingFace token you just created as the value.

Mark the environment variable as a Secret, click on the check mark to save it, and you’ll be good to go!

How to run the Llama 3.1 8B Instruct model on UbiOps

After UbiOps has finished building the deployment version, the model is ready to serve inference requests. Click on “Create Request” and enter your prompt within the input fields, here’s an example:

Note that the first request takes significantly longer than subsequent requests. The first request to a deployment is called a cold start. During a cold start, the model is loaded onto the hardware and gets set up properly to make inferences. The “additional time” this takes opposed to just handling requests is what’s defined as a cold start. This can take between 1 and 10 minutes. Once the request is finished you can have a look at Llama 3.1’s response:

Conclusion

We have now successfully deployed the new Llama 3.1 8B Instruct on UbiOps. We showed you how to:

Create a custom environment implicitly
Create a deployment & deployment version
Create an environment variable
Make requests to your deployment version’s endpoint

If you’re interested in how you can deploy other models to UbiOps or want more information about optimizing LLMs when they’re in production, have a look at some blog posts we released earlier:

If you’re curious about how UbiOps can help you with deploying and training your AI models, don’t hesitate to get in touch so we can have a chat about what we can do for you and your organization.

By industry

By application

On-demand GPU

Featured customers

Latest news

WaterFlex – Slim waterbeheer voor een stabiel energiesysteem

Why is Hybrid Cloud Deployment Useful?