What can you get out of this guide?
In this guide, we explain how to:
- Create a UbiOps account
- Create a code environment
- Accept Meta’s license agreement and retrieve your Hugging Face token
- Create your Llama 3 8B instruct deployment and deployment version
- Make an API call to Llama 3 8B instruct!
To successfully complete this guide, make sure you have:
You’ll also need the following files:
What is Llama 3 8B?
Llama 3 is the most recent model of the Llama series developed by Meta. It comes in two sizes, the compact 8 billion parameter version and the larger 70 billion parameter version. Along with this, Meta released Meta AI, a competitor to ChatGPT, built on top of Llama 3 models. Keep in mind that as of writing, Meta AI is not available in western Europe. Both models come with base versions and instruct versions.
What is special about the instruct version of Llama 3? The instruct version designates that it was instruction tuned. Instruction tuning is a process in which the base model is trained to respond to questions and respond in a conversational manner in contrast to the base model which will simply complete prompts. Therefore, unless you are going to fine-tune Llama 3, we recommend using the instruct version.
Llama 3 8B is very performant, according to the evaluation results released by Meta, it outperforms the similarly sized Mistral 7B and Gemma 7B models. Here is a table of some of the results:
MMLU 5-shot (% correct) | GPQA 0-shot (% correct) | HumanEval 0-shot (% correct) | |
Mistral 7B | 58.4 | 26.3 | 36.6 |
Gemma 7B | 53.3 | 21.4 | 30.5 |
Llama 3 8B | 68.4 | 32.4 | 62.2 |
As you can see, Llama 3 8B has state-of-the-art performance results. Taking a look at the results from Hugging Face’s LLM leaderboard:
Note: We took the results for the instruction-tuned version of each model
MMLU 5-shot (% correct) | ARC 25-shot (% correct) | WinoGrande 5-shot (% correct) | |
Llama 3 8B | 67.07 | 60.75 | 74.51 |
Gemma 7B | 53.52 | 51.45 | 67.96 |
Mistral 7B | 60.78 | 63.14 | 77.19 |
Again, these results are impressive. However, less so than those released by Meta. The model seems to be the most performant or roughly tied with the most performant within the 7–8B category. We have guides on how to deploy Gemma 7B and Mistral 7B on UbiOps.
In this guide we will be using the Llama 3 8B instruct version.
What is UbiOps?
UbiOps is a model serving and management platform which is perfect for managing fine-tuning tasks, serving models via a REST API endpoint and managing large-scale AI applications. We offer on-premise installation options as well as offering hardware instances. We offer tools like Pipelines which help you manage the many input and output datastreams in your AI application, you can connect those streams to models and perform logical operations on them. We also offer extensive monitoring, logging and event auditing capabilities. With these tools, you can monitor, evaluate and respond to potential errors in your deployment. Overall, UbiOps is a powerful AI serving and orchestration tool which is a useful addition to any machine learning tech stack.
How to deploy Llama 3 8B on UbiOps
The first step is to create a free UbiOps account. Simply sign up with an email address and within a few clicks you will be good to go. For this guide you will need to request GPU access. A free account isn’t powerful enough to run Llama 3 8B.
In UbiOps you work within an organization, which can contain one or more projects. Within these projects you can create multiple deployments, which are basically your containerized code. You can also chain together deployments to create a pipeline.
Create a project
Head over to the UbiOps WebApp and click on “Create new project”. You can give your project a unique name, or let UbiOps generate one for you.
Now that we have our project up and running we can start building the environment that our code will run in.
Create a code environment
See our documentation page for more information on environments.
A custom environment allows you to add custom dependencies to your deployment. You can create custom environments either explicitly or implicitly. Explicit environment creation means creating an environment in the Environments tab. Implicit environment creation means adding the environment files to your deployment package, which prompts UbiOps to create a custom environment automatically.
For this guide we’ll be building our environment explicitly. You can do this by going to the “Environments” tab on the left hand side, and clicking on “+ Custom environment”. Then, fill in the following parameters:
Name | llama-3-env |
Base environment | Ubuntu 22.04 + Python 3.10 + CUDA 11.7.1 |
Custom dependencies | Upload the environment package |
Then click on the “Create” button below and UbiOps will start building your environment (this should only take a few minutes).
Create your Llama 3 8B-Instruct deployment
Now you can navigate to the “Deployments” tab on the left and click on “Create”. In the following menu, you can define the name of the deployment as well as its input(s) and output(s). The input and output fields of the deployment define what data the deployment expects when making a request (i.e. when running the model). For this guide you can use the following:
Input Parameters:
Name | Data Type |
prompt | String |
system_prompt | String |
config | Dictionary |
system_prompt is a field which allows you to give extra instructions to the model. The config dictionary allows you to define extra configurations such as max_new_tokens and temperature. Click here for a full list of parameters and their description.
Output parameters:
Name | Data Type |
output | String |
input | String |
used_config | Dictionary |
After providing that information, UbiOps will generate Python and R deployment code snippets that can be downloaded or copied. These are used to create the deployment package. However, for this guide, you can ignore these.
To finish creating your deployment, click “Next: Create a version”.
Create a deployment version
Upload the Llama 3 8B-Instruct deployment file, which contains the code which retrieves the Llama 3 model from HuggingFace, loads it onto the hardware and configures how it should respond to prompts. In the environment settings, select llama-3-env from the “Select code environment” dropdown menu. Note:the environment should have finished building before you can select it.
Then, select the hardware the model will run on. For this deployment, you’ll need to Enable accelerated hardware and select the 16384MB + 4 vCPU + NVIDIA Ada Lovelace L4 instance. When you are happy with your settings, click on “Create” and UbiOps will get straight to work building your deployment version.
Retrieve your Llama 3 token and create an environment variable
To be able to use Llama 3 8B instruct, you will need to sign into Hugging Face and accept Meta’s license agreement on the Meta-Llama-3-8B-Instruct page.
After accepting, within Hugging Face, go to Settings->Access Tokens->New Token and generate a new token with the “read” permission. Copy the token to your clipboard.
Now navigate to your deployment and to the Environment variables tab. Click on “Create variable”. Name the variable “HF_TOKEN” and paste your Hugging Face token as the value. Make sure to make it Secret.
Now save it by clicking on the checkmark. Now you can make inference requests to Llama 3 8b Instruct!
How to run the Llama 3 8B Instruct model on UbiOps
Now we will show you how to make inference requests. Click on “Create Request” and enter your prompt within the input fields. Here is an example:
Keep in mind that the first request will take significantly longer than the next ones. This is because the first request will require a cold start. A cold start is the time it takes for the model to be loaded onto the hardware and get set up properly to make inferences. This can take 1–5 minutes. Once it has completed, you can view Llama 3’s response:
Conclusion
We have now successfully deployed Llama 3 Instruct on UbiOps. We showed you how to create a custom environment explicitly, create a deployment and deployment version, create an environment variable and make requests to your deployment version’s endpoint. All of this can be done in under 15 minutes—if you’re fast.
If you are interested in learning more about UbiOps’s functionalities or general information about LLMs, check out our guide on which LLM to choose for your use case, how to create a custom chatbot fine-tuned on your documentation or how to deploy LLaMa 2 with a customizable front-end (with Streamlit).