In this guide, we will take you through the newly released Phi-3.5 Small Language Models (SLMs). Which saw updates to Microsoft’s two existing Phi 3 SLMs: Phi-3.5-mini & Phi-3.5-vision, and the release of a new Phi model: the Phi-3.5 Mixture-of-Experts (MoE). We’ll also deploy the newly updated Phi-3.5-mini-instruct to UbiOps using UbiOps’s Command Line Interface (CLI).
What can you get out of this guide?
In this guide we will explain how you can deploy Phi 3.5 mini to UbiOps in three steps:
1. Create a UbiOps account & install the UbiOps CLI2.
2. Create a UbiOps deployment & deployment version
3. Make an API call to the Phi-3.5-mini-instruct
In order to follow along with this guide, make sure you have:
You will also need to download the following zip file::
The zip contains all the files we need to create and run Phi 3.5 on UbiOps:
- Two .yaml files which specify the parameters for the deployment & deployment version: deployment_config.yaml & version_config.yaml
- The deployment package, which contains the code that we will push to UbiOps:
deployment.py, requirements.txt, ubiops.yaml
- A .json file, which contains data that we can use when we make a request to the model called data.json
What are the Phi-3.5 SLMs?
The new Phi-3.5 models are newly added additions to Microsoft Phi 3 family of SLMs, which were released in April 2024. The Phi-3.5 models, released in August 2024, consist of two updates for existing models: the Phi-3.5-mini & Phi-3.5-vision, and the release of a whole new model: the Phi-3.5-MoE (Mixture of Experts).
All models can be downloaded on Microsoft’s repository on Hugging Face.
The Phi-3.5-MoE features 16 small experts, and is claimed to deliver reduced latency & high-quality performance. Same as with the Phi-3.5-mini, the model supports a 128K context length and multilingual capabilities. Microsoft claims that this new model, which has 6.6B parameters, performs better than larger models:
Source: 4225280
Phi-3.5-mini
Microsoft states that they have subjected the Phi-3.5-mini to further pre-training, making use of multi-lingual high-quality & synthetic data. The pre-training phase was then followed up by a series of post-training steps of different fine-tuning techniques, like Supervised Fine-Tuning, and Direct Preference Optimization.
Microsoft further states that the extra pre-training, and post-training also resulted in a substantial increase in the model’s capability on multi-lingual, multi-turn conversation quality, and reasoning capabilities. The new version was also trained on a select set of languages, like Dutch, Arabic, Chinese, German etc., which resulted in an improvement of 25-50% on some languages compared to Phi-3-mini.
With the context length being increased to 128k tokens (which is the same as the newly released Llama 3.1 model), Phi-3.5-mini is now also suited for use cases like summarization of long documents, information retrieval and meeting transcripts.
This resulted in the relatively small model (having 3.8B parameters) to keep up, or even surpass rival models of larger size, as is shown in the table below:
Source: 4225280
Phi-3.5-vision
Microsoft used customer feedback to improve Phi-3-vision, which enabled the release of Phi-3.5-vision. With the upgrade, the model now possesses cutting-edge capabilities, according to Microsoft, which can be used for multi-frame image understanding & reasoning. This means the model can now be used for a wide array of use cases which include multi-image storytelling, summarization & video summarization.
Microsoft released the following table, which shows how different models scored on vision tasks benchmarks:
Source: 4225280
It should be mentioned that although the new Phi-3.5 models are multilingual, Microsoft still recommends to make use of a Retrieval-Augmented Generation (RAG) framework, or fine-tune the model further if you want to use the model for such use cases
What is UbiOps?
As mentioned in the introduction, we will be deploying Phi-3.5-mini-instruct to production. For this we will make use of UbiOps, which is a platform where you can deploy (Gen-)AI products, like Llama 3.1, or Stable Diffusion to a production environment. UbiOps ensures the availability of your model, having an uptime of 99.99%, and takes care of autoscaling in the background. UbiOps also gives your model access to state-of-the-art hardware, to provide the fastest inference time. Private data also is not a problem when working with UbiOps, which offers on-premise, hybrid or cloud solutions for you to run your model.
How to deploy Phi-3.5-mini-instruct to UbiOps
The first thing you need to do is to create a UbiOps account. You can simply do this by using your email address to sign up, and after a few clicks you will be good to go. Note that your account needs to have access to GPUs to run this model, please contact support if you need access to one.
UbiOps works with organizations. Within each organization you can have one or more projects. You can deploy your containerized code, which at UbiOps are called deployments, within these projects. UbiOps also has a feature where you can chain deployments together, to orchestrate your workflow, which are called pipelines in UbiOps.
Create a project
Navigate to the UbiOps WebApp and click on “Create new project”. You can choose your own unique name, or let UbiOps generate one for you.
Now that we have our project set up, we can start installing the UbiOps CLI and set the default project.
Install the UbiOps CLI
In your favorite terminal, run the following command to install the UbiOps CLI:
Note: You need to have Python installed for this to work in the terminal
pip install ubiops-cli
Now we can log in to UbiOps from our terminal, and set the default project using the CLI. Replace PROJECT_NAME in the code cell below with the name of the project your created earlier.
ubiops signin
ubiops current_project set PROJECT_NAME
Create your Phi-3.5-mini-instruct deployment
To create the deployment for Phi-3.5-mini-instruct, we can use the deployment_config.yaml file we downloaded earlier. This file contains the structure of the deployment we will create on UbiOps and includes the name, the input, and the output of the deployment. By defining the in- & output of the deployment, we instruct UbiOps what type of data to expect when running the model.
As you can see in the table below, the model expects a prompt (the user’s input) but also the system_prompt (instruction and/or context for the model) and a config (which you can use to influence the SLMs behavior). The deployment is set up in such a way that you can specify different system_prompts and configurations (temperature etc.) for every request. If the system_prompt and config are left empty when making the request, a default parameter will be used for both.
As output, the deployment returns the generated output, i.e., the response to the user’s prompt & system prompt, the input of the model (system_prompt + prompt) and the used_config which was used for creating the output:
The input and output variables for this deployment are configured as follows:
Input Configuration | |
Name | Data type |
prompt | String |
system_prompt | String |
config | Dictionary |
Output Configuration | |
Name | Data type |
output | String |
input | String |
used_config | Dictionary |
To create the deployment on UbiOps you can use the following command in your terminal:
ubiops deployments create -f deployment_config.yaml
If you navigate to your UbiOps environment now, you should see a new deployment called “phi3-5” inside your project.
The next step is to deploy the code inside the deployment package to UbiOps, but before that, let us take a closer look at what we are actually deploying to UbiOps.
The deployment package
The deployment package contains three files:
- The deployment.py, which contains the code which is ran every time a request is made to your deployment
- The requirements.txt, which contains all the Python packages we need to run the code in the deployment.py.
- A ubiops.yaml, which we use to download a specific CUDA version.
The deployment.py is a Python Class containing two objects:
- The __init__, which contains the code which is run when the deployment spins up
- The request, which contains the code every time a requests is made to your deployment
The first request made to a deployment always takes longer than subsequent requests. This is because UbiOps needs to download the model onto the hardware we specify later, and set up the deployment to handle inference requests. This is the part of the code that is in the __init__ of the deployment (see below). This process can take anywhere from 1 to 10 minutes. After this is done UbiOps will handle the code that is in the request part of the deployment.py.
The entire deployment.py for this deployment looks like this:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
class Deployment:
def __init__(self, base_directory, context):
print("Initialising My Deployment")
model_id = os.environ.get("MODEL_ID", "microsoft/Phi-3.5-mini-instruct")
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
self.tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
self.pipe = pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer
)
self.config = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False
}
def request(self, data, context):
print("Processing request for My Deployment")
if data["system_prompt"]:
system_prompt = data["system_prompt"]
else:
system_prompt = "A user is going to ask you a question. Please reply adequately."
# Update config dic if user added a config dict
config = self.config.copy()
if data["config"]:
config.update(data["config"])
messages = [
{"role": "system", "content": f"{system_prompt}"},
{"role": "user", "content": data["prompt"]}
]
response = self.pipe(messages, **config)
output = response[0]["generated_text"]
# Here we set our output parameters in the form of a json
return {"output": output,
"input": messages,
"used_config": config}
You can manage the deployed code and the environment that it runs in separately on UbiOps. To create the environment we can create a custom environment, which consists of a base environment and additional dependencies defined in the environment files (the requirements.txt & ubiops.yaml), which will create a custom environment on UbiOps..
Custom environments can be created in two ways:
- Explicitly: by creating a environment in the “Environments” tab in the WebApp (or by using the “ubiops environments create” command)
- Implicitly, by adding the environment files to the deployment package. This is what we will do for this blogpost
The requirements.txt & ubiops.yaml in the deployment package will be used to create a custom environment on UbiOps for running the Phi-3.5-mini-instruct.
Create a deployment version
Now that you know what code you are actually deploying on UbiOps, it is time to deploy it to UbiOps. We make use of the “ubiops deployments deploy” command to create a version for the deployment. You can create one or more versions of a deployment. Each version shares the same in- & output, but the deployed code, the environment it runs in, the instance type and other settings can all be different.
The specifications for this deployment version can be found in the version_config.yaml, which is in the .zip file we downloaded earlier . That is where the base environment (to which the additional dependencies in the requirements.txt & ubiops.yaml are added), the instance type and other settings are specified.
Phi-3.5-mini-instruct needs to run on a GPU, therefore we need to select the “16384 MB + 4 vCPU + NVIDIA Ada Lovelace L4” in the version_config.yaml for this version.
To deploy the code to UbiOps, you can run the following command inside your terminal:
ubiops deployments deploy -f version_config.yaml -dir deployment_package
To see if the previous command completed successfully, you can either navigate to the WebApp and have a look there, of run the following command:
ubiops deployment_versions list -d phi3-5
Make an API call to the Phi-3.5-mini-instruct
Now we use the “ubiops deployments requests create” command to make an API call to the Phi 3.5 deployment. The request data can be found in the data.json file, in the zip file we downloaded before:
{
"prompt": "Tell me a joke",
"system_prompt": "You're a pirate",
"config": {}
}
If you want to make a request using a different system_prompt, you can do so by changing the value of system_prompt in the data.json file.
You can make a request by entering the following command in your terminal:
ubiops deployments requests create phi3-5 -v v-mini-instruct -f data.json
After the request is completed, you can either check your terminal for the model its response, or navigate to the WebApp and have a look there:
Conclusion
You have now successfully deployed the new Phi-3.5-mini-instruct to UbiOps using the UbiOps CLI. In this blog post we talked you through the new Phi 3.5 SLMs, and explained you how:
- You can create a deployment
- You can create a custom environment
- You can create a deployment version
- Make a request to your deployment version API endpoint
You can also deploy other models to UbiOps, if you are interested in that, or want more information about optimizing models when they are in production, you can have a look at some blog posts we released earlier:
- Deploy Llama 3.1 Instruct on UbiOps
- How to build a RAG query engine with LlamaIndex and UbiOps
- Fine-tune a model on your documentation
If you are wondering about how UbiOps can help you with deploying and training your AI models, do not hesitate to contact us so we can have a conversation about what we can do for you and your organization.