Deploy Gemma 7B in under 15 minutes with UbiOps

What can you get out of this guide?

In this guide, we explain how to:

Create a UbiOps trial account
Create a code environment
Retrieve your Hugging Face token and accept Google’s license
Create your Gemma deployment
Create a deployment version with a GPU instance type
Make an API call to Gemma 7B!

To successfully complete this guide, make sure you have:

Python 3.10 or higher installed
UbiOps Client Library installed
UbiOps account

You’ll also need the following files which are available in the appendix:

What is Gemma 7B?

Gemma is the latest model series released by Google in February 2024. It comes in two sizes, a 2B version, intended to be run on mobile devices and laptops, and a 7B version, intended to be run on desktop computers and small servers.

How performant is Gemma 7B? According to the numbers released by Google, it performs very well. It surpasses the similarly sized Mistral 7B model and the almost twice as large LLaMa 13B. Gemma is clearly a model series designed to be cost effective. According to Google’s performance benchmarks, it delivers on this aim.

Model	Mistral 7B	LLaMa 2 7B	LLaMa 2 13B	Gemma 7B
Average score (%)	54.0	47.0	52.2	56.4

Performance of Gemma 7B versus Mistral and LLaMa 2 models, source: Gemma technical report

Model	Mistral 7B	Falcon 7B	Falcon 40B	Gemma 7B
Average score (%)	60.97	44.17	58.07	63.75

Performance of Gemma 7B versus Mistral and Falcon models, source: Hugging Face Open LLM leaderboard

Overall, Gemma 7B is a fairly light and very performant model. It is readily available on Hugging Face—given you accept Google’s license agreement.

What is UbiOps?

UbiOps is a powerful AI model serving and orchestration service with unmatched simplicity, speed and scale. UbiOps minimizes DevOps time and costs to run, train and manage AI models, and distributes them on any compute infrastructure at scale. It is built for training, deploying, running and managing production-grade AI in an agile way. It features unique functionality for workflow orchestration (Pipelines), automatic adaptive scaling in hybrid or multi-cloud environments as well as key MLOps features. You can learn more about UbiOps features on our Product page.

How to deploy Gemma 7B on UbiOps

The first step is to create a UbiOps account. Simply sign up with an email address and within a few clicks you will be good to go.

In UbiOps you work within an organization, which can contain one or more projects. Within these projects you can create multiple deployments, which are basically your containerized code. You can also chain together deployments to create a pipeline.

Create a project

Head over to the UbiOps WebApp and click on “Create new project”. You can give your project a unique name, or let UbiOps generate one for you.

Now that we have our project up and running we can start building the environment that our code will run in.

Create a code environment

See our documentation page for more information on environments.

For this guide we’ll be building our environment explicitly. You can do this by going to the “Environments” tab on the left hand side, and clicking on “+ Custom environment”. Then, fill in the following parameters:

Name	gemma-7b-environment
Base environment	Ubuntu 22.04 + Python 3.10 + CUDA 11.7.1
Custom dependencies	Upload the environment package (link above)

Then click on the “Create” button below and UbiOps will start building your environment (this should only take a few minutes).

Retrieve your Gemma 7B token

To be able to download Gemma 7B from Hugging face, you will need a Hugging Face api token showing you have accepted Google’s license agreement. Firstly, accept the following license agreement on HuggingFace by pressing the button ‘Acknowledge license’ on the Gemma 7B Hugging Face page.

Follow the instructions and accept the license agreement if you agree to the terms.

Secondly, go to Settings->Access Tokens->New Token and generate a new token with the “read”permission. Copy the token to your clipboard.

Lastly, you will need to add this token to the `deployment.py` file which is inside the `gemma-7b-deployment.zip` file you downloaded at the start of the guide. Unzip it and open `deployment.py`.

Insert your token within the quotations, save, and re-zip the folder.

Modify the system prompt

You can set a system prompt which will give the LLM some specifics concerning the type or tone of all the prompts. You can edit this inside the `deployment.py` file. In our case, we have it set to give each prompt a pirate tone. You can edit this to anything you want.

Create your Gemma 7B deployment

Now you can navigate to the “Deployments” tab on the left and click on “Create”. In the following menu, you can define the name of the deployment as well as its input(s) and output(s). The input and output fields of the deployment define what data the deployment expects when making a request (i.e. when running the model). For this guide you can use the following:

Name	gemma-7b-deployment
Input	Type: Structured, Name: Request, Data Type: String
Output	Type: Structured, Name: Response, Data Type: String

After providing that information, UbiOps will generate Python and R deployment code snippets that can be downloaded or copied. These are used to create the deployment package. For this guide, we will be using Python.

To finish creating your deployment, click “Next: Create a version”.

Once that’s done, your model will be ready for action!

Create a deployment version

Upload the Gemma 7B deployment file, which contains the code that downloads the Gemma 7B model from Hugging Face and runs it on UbiOps.

Choose to “Enable Accelerated Hardware”, this will enable you to select GPU instance types. For Gemma 7B, we recommend either Nvidia’s T4 or L4 GPUs.

Next, for “Select code environment” choose the environment which was created at the beginning of the guide named gemma-7b-environment-gpu.

How to run a Gemma 7B model on UbiOps

Navigate to your Gemma 7B deployment version and click on the “Create Request” button to create your first request. For this model your request could be something like “Tell me about whales”. Here is the response:

How easy was that?

Conclusion

And there we have it!

Our very own Gemma 7B API, hosted and served on UbiOps. All in under 15 minutes, without needing a software engineer. We also showed you how to modify the system prompt, which is a prompt engineering technique modifying Gemma to have any specific tone or characteristic. In our case, Gemma was a pirate. If you want to learn more about LLMs, read our article about fine-tuning or this article about selecting the right LLM for your use case.

Thanks for reading!

By industry

By application

On-demand GPU

Featured customers

NEW! Webinar with ReefSupport!

Latest news

UbiOps vs standard Model Serving Platforms

New UbiOps features July 2024