Skip to content

Deploy Gemma 2B on UbiOps

Download notebook View source code

This tutorial will help you create a cloud-based inference API endpoint for the gemma-2b-it model, using UbiOps. Gemma-2b-it is a lightweight LLM developed by Google, that can run on a CPU-type instance. It was developed by Google, and is available via Huggingface.

Note that you Gemma is a gated model, so you will need to have a valid Huggingface token with sufficient permissions if you want to download the Gemma-2b-it from Huggingface. You can apply for one in the repository of the respective model. The model can also be uploaded to your UbiOps bucket, and downloaded from there.

In this tutorial we will walk you through.

  1. Connecting with the UbiOps API client
  2. Creating a code environment for our deployment
  3. Creating a deployment for the Gemma 2B model
  4. Calling the Gemma 2B deployment API endpoint
Deployment input & output variables Variable name Data type
Input fields prompt string
config dict
Output fields response string

1. Connecting with the UbiOps API client

To use the UbiOps API from our notebook, we need to install the UbiOps Python client library first

!pip install -qU ubiops

To set up a connection with the UbiOps platform API we need the name of your UbiOps project and an API token with project-editor permissions. See our documentation on how to create a token.

Once you have your project name and API token, paste them below in the following cell before running.

import ubiops
import shutil
import os
from datetime import datetime

# Set some variables that we will use later to create our deployment
ENVIRONMENT_NAME = "gemma-env"

DEPLOYMENT_NAME = f"gemma-{datetime.now().date()}"
DEPLOYMENT_VERSION = "v1"

# Define our tokens
API_TOKEN = "<API TOKEN>"  # Make sure this is in the format "Token token-code"
PROJECT_NAME = "<PROJECT_NAME>"  # Fill in your project name here
HF_TOKEN = "<HF_TOKEN>"  # Format: "hf_xyz"


# Initialize client library
configuration = ubiops.Configuration(host="https://api.ubiops.com/v2.1")
configuration.api_key["Authorization"] = API_TOKEN

# Establish a connection
client = ubiops.ApiClient(configuration)
api = ubiops.CoreApi(client)
print(api.service_status())

2. Setting up the environment

First set and create a directory to store environment files

environment_dir = "environment_package"

!mkdir {environment_dir}

Then create the environment files with the Python packages that we will use in our solution

%%writefile {environment_dir}/requirements.txt
huggingface-hub==0.20.3
transformers==4.38.1
torch==2.1.0

Now we create a UbiOps environment. We select a base environment with Ubuntu 22.04 and Python3.10 pre-installed. Our additional dependencies are installed on top of this base environment, to create our new custom_environment with the name that we specified earlier.

api_response = api.environments_create(
    project_name=PROJECT_NAME,
    data=ubiops.EnvironmentCreate(
        name=ENVIRONMENT_NAME, base_environment="ubuntu22-04-python3-10"
    ),
)

Package the environment files, upload it to our custom environment, and wait until it's built before continuing (which should normally take around ten minutes). You can check out the new environment and the building process in the UbiOps WebApp in the meantime.

env_files = shutil.make_archive(environment_dir, "zip", ".", environment_dir)
api.environment_revisions_file_upload(
    project_name=PROJECT_NAME,
    environment_name=ENVIRONMENT_NAME,
    file=env_files,
)

ubiops.utils.wait_for_environment(
    client=api.api_client,
    project_name=PROJECT_NAME,
    environment_name=ENVIRONMENT_NAME,
    stream_logs=True,
)

3. Creating a deployment for the LLaMa 2 7B model

Now that we have created our code environment in UbiOps, it is time to write the actual code to run the Gemma-2b model and push it to UbiOps.

As you can see we're uploading a deployment.py file with a Deployment class and two methods: - __init__ will run when the deployment starts up and can be used to load models, data, artifacts and other requirements for inference. - request() will run every time a call is made to the model REST API endpoint and includes all the logic for processing data.

Separating the logic between the two methods will ensure fast model response times. We will load the model from Huggingface in the __init__ method, and code that needs to be run when a call is made to the deployment in the request() method. This way the model only needs to be loaded in when the deployment starts up.

Now set and create a directory to store deployment files

deployment_code_dir = "deployment_package"

!mkdir {deployment_code_dir}

And add the deployment.py to the directory

%%writefile {deployment_code_dir}/deployment.py
import transformers
import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, pipeline


class Deployment:

    def __init__(self, base_directory, context):


        # Log in to huggingface
        token = os.environ["HF_TOKEN"]

        login(token=token)

        # Download Gemma from Huggingface
        model_id = os.environ.get("MODEL_ID", "google/gemma-2b-it")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(model_id)


        # You can change the system prompt by adding an environment variable to your deployment (version)
        self.system_prompt = os.environ.get("SYSTEM_PROMPT", "You are a friendly chatbot who always responds in the style of a pirate")

        # Set some default configurationparameters
        self.default_config = {
            'max_length':256,
            'eos_token_id': self.tokenizer.eos_token_id,
            'temperature': 0.3
        }


    def request(self, data, context):
        user_prompt = data["prompt"]
        config = self.default_config.copy()

        # Update config dict if user added a config dict
        if data["config"]:
            config.update(data["config"])

        chat = [
            { "role": "user", "content": f"{self.system_prompt} \n {user_prompt}"}
        ]
        print("Applied chat: \n", chat)

        prompt = self.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        inputs = self.tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

        # Restrict the output message to 256 tokens
        generate_ids = self.model.generate(inputs, **config)
        response = self.tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print(response)

        # Here we set our output parameters in the form of a json
        return {"response": response}

Create the deploymenta & deployment version

Create the deployment

# Create the deployment
deployment_template = ubiops.DeploymentCreate(
    name=DEPLOYMENT_NAME,
    input_type="structured",
    output_type="structured",
    input_fields=[
        {"name": "prompt", "data_type": "string"},
        {"name": "config", "data_type": "dict"},
    ],
    output_fields=[{"name": "response", "data_type": "string"}],
)

api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)

And a deployment version

# Create the version
version_template = ubiops.DeploymentVersionCreate(
    version=DEPLOYMENT_VERSION,
    environment=ENVIRONMENT_NAME,
    instance_type_group_name="16384 MB + 4 vCPU",
    maximum_instances=1,
    minimum_instances=0,
    maximum_idle_time=600,  # = 10 minutes
    request_retention_mode="full",
)

api.deployment_versions_create(
    project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template
)

Then upload a revision for the deployment version

# And now we zip our code (deployment package) and push it to the version

import shutil

deployment_code_archive = shutil.make_archive(
    deployment_code_dir, "zip", deployment_code_dir
)

upload_response = api.revisions_file_upload(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    file=deployment_code_archive,
)
print(upload_response)

# Check if the deployment is finished building. Because the environment has already been built, this step should only take
# a couple of seconds
ubiops.utils.wait_for_deployment_version(
    client=api.api_client,
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    revision_id=upload_response.revision,
)

Then add an environment variable with the valid Huggingface token

api_response = api.deployment_version_environment_variables_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    data=ubiops.EnvironmentVariableCreate(name="HF_TOKEN", value=HF_TOKEN, secret=True),
)

If you are not happy with the default SYSTEM_PROMPT that we provided, you can add your own system prompt here

CUSTOM_SYSTEM_PROMPT = "You are a friendly chatbot who always responds in the style of a man with a mission"

api_response = api.deployment_version_environment_variables_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    data=ubiops.EnvironmentVariableCreate(
        name="SYSTEM_PROMPT", value=CUSTOM_SYSTEM_PROMPT, secret=False
    ),
)

Create a request

We can now send our first prompt to our Gemma LLM! On the first spin-up, the model will need to be downloaded from Huggingface, resulting in a cold-start time of a couple of minutes. Subsequent requests should be handled faster. Do note that this model has a rather long inference time in general. You can check the UI for the progress on the request.

data = {
    "prompt": "I accidentally brought the Black Plague on my ship. How do I blame the crew?",
    "config": {},
}

response = api.deployment_version_requests_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    data=data,
    version=DEPLOYMENT_VERSION,
    timeout=3600,
)

print(response.result)

So that's it! You now have your own on-demand, scalable Gemma 2B Instruct model running in the cloud, with a REST API that you can reach from anywhere!