Download files from Huggingface and upload them to UbiOps¶

When working with larger models, you might want to upload you're model to your UbiOps storage bucket. This prevents you from having to download the entire model & other necessary files (like the tokenizer) from Huggingface into your deployment everytime it spins up. Which means that you can still use your model when Huggingface is down. Storing your model on UbiOps also enables you to run your model in an air gapped environment, and allows you to keep the entire architecture for inference inside UbiOps. Lastly, it will give you more control about the version of the model that you are using.

This how-to will show you how you can create a Python script that downloads models from Huggingface, and store them into a UbiOps bucket. Furthermore, we will also show you how you can execute the code entirely on UbiOps using the UbiOps Training functionallity. You can find download links for the necessary files further below.

Note that in order to follow the steps described in this how-to you'll need to have the training functionallity enabled.

For running this process on UbiOps you need to perform the following steps:

Create a custom environment for the training run.
Create a training experiment to set-up the training run (the actual code executions).
Create a training run which will download the files from Huggingface and uploads them to UbiOps.

If you want to use this method we advise you to create a new bucket inside your project specifically for model-weights.

Creating the Python script¶

In order to download the correct files from Huggingface, we need to define some hyperparameters first. After defing the parameters we'll establish a connection to the UbiOps API. Note that the code snippet below is set up for a repository on Huggingface that has gated access, therefore we use the login function from huggingface_hub to log in. If the model that you want to download doesn't require you to provide a Huggingface token, you can skip this step.

After defining the parameters, establishing a connection with UbiOps, & optionally login to Huggingface. We use the snapshot_download function from huggingface_hub to download the repository that you defined as the MODEL_ID:

from huggingface_hub import snapshot_download, login
import ubiops
import os

# Define hyperparameters
API_TOKEN = "<INSERT YOUR UBIOPS_API_TOKEN>"
PROJECT_NAME = "<INSERT YOUR PROJECT_NAME>"

# Optional
HF_TOKEN = "<INSERT YOUR HUGGINGFACE_TOKEN>"

MODEL_ID = "<INSERT_THE_MODEL_ID>" # Huggingface model ID to download
BUCKET_NAME ="<INSERT_BUCKET_NAME_" # Bucket to which to upload snapshot to
DIR = "<INSERT_YOUR _BUCKET_DIRECTORY" # Directory on bucket to which to upload

# Connect to the UbiOps API
configuration = ubiops.Configuration()
configuration.api_key['Authorization'] = API_TOKEN
configuration.host = "https://api.ubiops.com/v2.1"
api_client = ubiops.ApiClient(configuration)

# Optionally log in to Huggingface, in case you want to acces a gated repo
login(HF_TOKEN)

# Download a snapshot of the model's repository
print(f"Downloading snapshot of {MODEL_ID} from Huggingface")
path = snapshot_download(repo_id=MODEL_ID,
                    local_dir=DIR
)

Now define which files you want to upload to UbiOps, along with the path you want to upload them

print(f"Downloaded snapshot of model card {MODEL_ID} from Huggingface to {path}")

files = os.listdir(DIR)
file_mappings = [
    {"file_path": os.path.join(os.getcwd(), DIR, file_name), "upload_path": os.path.join(DIR, file_name)} 
    for file_name in os.listdir(DIR) 
    if os.path.isfile(os.path.join(os.getcwd(), DIR, file_name))
]

# Print information about which file is going to be uploaded, and to which project.
print(f"Going to upload files to bucket '{BUCKET_NAME}' in project '{PROJECT_NAME}' \n:")
for file_info in file_mappings:
    print(file_info['file_path'])

The final step is to use the upload_file function from the UbiOps utils module to upload the files to your UbiOps bucket:

# Define upload path
for file_mapping in file_mappings:
    file_path = file_mapping["file_path"]
    upload_path = file_mapping["upload_path"]

    print(f"Uploading file:\n{upload_path}")
    print(f"To directory:\n{file_path}")

    # Upload the files
    ubiops.utils.upload_file( 
        client=api_client, 
        project_name=PROJECT_NAME, 
        file_path=file_path, 
        bucket_name=BUCKET_NAME, 
        file_name=upload_path
    )
    print(f"Upload of file:\n{upload_path} was succesful")
print("Finished downloading all files")

You can download the file shown below here.

Click here to see the full script for running this process locally

from huggingface_hub import snapshot_download, login
import ubiops
import os

# Define hyperparameters
API_TOKEN = "<INSERT YOUR UBIOPS_API_TOKEN>"
PROJECT_NAME = "<INSERT YOUR PROJECT_NAME>"

# Optional
HF_TOKEN = "<INSERT YOUR HUGGINGFACE_TOKEN>"

MODEL_ID = "<INSERT_THE_MODEL_ID>" # Huggingface model ID to download
BUCKET_NAME ="<INSERT_BUCKET_NAME_" # Bucket to which to upload snapshot to
DIR = "<INSERT_YOUR _BUCKET_DIRECTORY" # Directory on bucket to which to upload

# Connect to the UbiOps API
configuration = ubiops.Configuration()
configuration.api_key['Authorization'] = API_TOKEN
configuration.host = "https://api.ubiops.com/v2.1"
api_client = ubiops.ApiClient(configuration)

# Optionally log in to Huggingface, in case you want to acces a gated repo
login(HF_TOKEN)

# Download a snapshot of the model's repository
print(f"Downloading snapshot of {MODEL_ID} from Huggingface")
path = snapshot_download(repo_id=MODEL_ID,
                    local_dir=DIR
)

print(f"Downloaded snapshot of model card {MODEL_ID} from Huggingface to {path}")

files = os.listdir(DIR)
file_mappings = [
    {"file_path": os.path.join(os.getcwd(), DIR, file_name), "upload_path": os.path.join(DIR, file_name)} 
    for file_name in os.listdir(DIR) 
    if os.path.isfile(os.path.join(os.getcwd(), DIR, file_name))
]

# Print information about which file is going to be uploaded, and to which project.
print(f"Going to upload files to bucket '{BUCKET_NAME}' in project '{PROJECT_NAME}' \n:")
for file_info in file_mappings:
    print(file_info['file_path'])

# Define upload path
for file_mapping in file_mappings:
    file_path = file_mapping["file_path"]
    upload_path = file_mapping["upload_path"]

    print(f"Uploading file:\n{upload_path}")
    print(f"To directory:\n{file_path}")

    # Upload the files
    ubiops.utils.upload_file( 
        client=api_client, 
        project_name=PROJECT_NAME, 
        file_path=file_path, 
        bucket_name=BUCKET_NAME, 
        file_name=upload_path
    )
    print(f"Upload of file:\n{upload_path} was succesful")
print("Finished downloading all files")

Now let's have a look at how we can run this process entirely on UbiOps, using the UbiOps Training functionallity. In order to run this process on UbiOps we will first need to:

Create a custom environment
Create an experiment
Initiate a training run that executes the (slightly) modified code shown above

Creating the custom environment¶

Before we can create an experiment, we first need to create an environment that contains the correct dependecies. We can do this by selecting a base environment, which we can append with additional dependencies by uploading a requirements.txt.

For this environment you can select Ubuntu 22.04 + Python 3.10 as the base environment. The requirements.txt with the correct dependencies is shown below:

ubiops==4.4.1
huggingface-hub==0.22.2

Creating the experiment¶

Now that we have our environment set-up it's time to create the experiment. The experiment defines the training set-up used for the training runs, which in this case is going to be downloading the files from Huggingface & uploading them to UbiOps. For the experiment we'll need to define the following parameters:

Experiment configuration
Name	upload-models
Description	Experiment for uploading Huggingface models
Hardware settings:	16384 MB + 4 vCPU
Environment settings:	Select the environment you created in the previous step
Select default output bucket	Choose which bucket you'd like the model weights stored in here, you can also specify the bucket later as a parameter for the training run
Environment variables	Name: PROJECT_NAME, Value: , Secret: No
	Name: UBIOPS_API_TOKEN , Value: , Secret: Yes
If the model you want to download has restricted access, you'll also need:	Name: HF_TOKEN , Value: , Secret: Yes

Click Create and your experiment will start building!

Click here for an image of how your experiment may look like

Initiate the training run¶

As final step for this how-to we'll initiate a training run. As mentioned before the training run contains the code that will download the files from Huggingface, and upload them to the Default output bucket you selected when creating the experiment. Note that it's also possible to store the model weights in a different bucket by passing the bucket_name as an input parameter. The code we'll use for the training run is shown below.

For the training run we'll need to specify the following parameters:

Training run configuration
Name	upload-name of the model you want to upload
Maximum duration	For most models the default 14400 seconds is sufficient, for larger models a higher timeout is recommended
Training code	Download this file, or copy from below
Training data and parameters	Leave as is
Parameters	{ "model_id": "model_id", "bucket_name": "model-weights", "dir": "model_name" }

Below you can find an image of an experiment, where the train.py (also shown below) was ran to download and upload the files for the Llama-3-8B-Instruct model. For this run the following parameters were used:

{
"model_id":"meta-llama/Meta-Llama-3-8B-Instruct",
"bucket_name":"model-weights",
"dir":"llama-3b-it"
}

Click here to see an image of the experiment with the completed training run

Click here to see the `train.py` which is uploaded to UbiOps

from huggingface_hub import snapshot_download, login
import ubiops
import os

def train(training_data, parameters, context):
    # Define hyperparameters
    API_TOKEN = os.environ["UBIOPS_API_TOKEN"]
    PROJECT_NAME = os.environ["PROJECT_NAME"]
    HF_TOKEN = os.environ["HF_TOKEN"]

    MODEL_ID = parameters["model_id"] # Huggingface model ID to download
    BUCKET_NAME =parameters["bucket_name"] # Bucket to which to upload snapshot to
    DIR = parameters["dir"] # Directory on bucket to which to upload

    # Connect to the UbiOps API
    configuration = ubiops.Configuration()
    configuration.api_key['Authorization'] = API_TOKEN
    configuration.host = "https://api.ubiops.com/v2.1"
    api_client = ubiops.ApiClient(configuration)

    # Optionally log in to Huggingface, in case you want to acces a gated repo
    login(HF_TOKEN)

    print(f"Downloading snapshot of {MODEL_ID} from Huggingface")
    path = snapshot_download(repo_id=MODEL_ID,
                        local_dir=DIR
    )

    # Define files to upload and the path to which to upload them

    print(f"Downloaded snapshot of model card {MODEL_ID} from Huggingface to {path}")
    files = os.listdir(DIR)
    file_mappings = [
        {"file_path": os.path.join(os.getcwd(), DIR, file_name), "upload_path": os.path.join(DIR, file_name)} 
        for file_name in os.listdir(DIR) 
        if os.path.isfile(os.path.join(os.getcwd(), DIR, file_name))
    ]

    # Print information about which file is going to be uploaded, and to which project
    print(f"Going to upload files to bucket '{BUCKET_NAME}' in project '{PROJECT_NAME}' \n:")
    for file_info in file_mappings:
        print(file_info['file_path'])

    # Define upload path
    for file_mapping in file_mappings:
        file_path = file_mapping["file_path"]
        upload_path = file_mapping["upload_path"]

        print(f"Uploading file:\n{upload_path}")
        print(f"To directory:\n{file_path}")

        # Upload the files
        ubiops.utils.upload_file( 
            client=api_client, 
            project_name=PROJECT_NAME, 
            file_path=file_path, 
            bucket_name=BUCKET_NAME, 
            file_name=upload_path
        )
        print(f"Upload of file:\n{upload_path} was succesful")

    print("Finished downloading all files")

Here's an example of your storage bucket after following all the steps described above:

upload-models-storage