Skip to content

How-to train a model on UbiOps

UbiOps is mainly built for models that have already been trained and are ready to run in production. But it is also possible to use UbiOps to train a model, by creating a deployment which takes training data as input and returns a trained model file as output.

This page explains the key differences between a normal deployment and a training deployment, and how you can create a training deployment yourself. You can also use this example to create a deployment that retrains an existing model.

Creating a deployment for training

A deployment used for training is pretty similar to a deployment for running a model. But instead of inference code, we need to include the training code in the request() function of the deployment.

As input for a training deployment we want to pass the training data set and any (hyper)parameters that we want to adjust for each training run. As output we want to return the trained model parameter file and metrics about the training run.

In this example we configure a deployment with the input and output variables and data types as shown in the table below:

Deployment input & output variables
Variable name Data type
Input fields nr_epochs int
training_data Blob (file)
Output fields model_file Blob (file)
loss double (float)
accuracy double (float)

Specifying training parameters as deployment input has another benefit: you can run simultaneous training runs to test which parameters give the best result.

Loading the training data

Sending training data in a request to a deployment can be done in two ways.

  • The first option is to send it as a file to the deployment through the UbiOps API. This works well for small and medium-sized data sets (up to ~512MB). You can do this by configuring your deployment to have the blob data type as input, which is for handling files.

  • The second option is to load the data from an external (object) storage platform like Amazon S3, Google Cloud Storage or Snowflake directly into your deployment. This can be achieved by including the code for calling the storage API inside the request() function. This allows you to leverage the performance of existing storage systems and work with much larger data files. Tip: you can use the authentication secrets securely in your deployment code by using environment variables

The same holds for storing output data. You can either retrieve this through the UbiOps API or store it directly from your deployment code in an external storage system using a client.

Scaling up

If you want to start multiple training runs simultaneously you can set the parameter max_instances to any number higher than one, depending on how many jobs you want to run in parallel and make multiple requests to the deployment. Now they will be executed simultaneously or queued until the next instance is available. You can see the status of each request in the Requests tab in the Monitoring section.

Example

Next, the example will show how you can set up a deployment in such a way that it trains a model, and how to automate this process, by using the Client Library and the Webapp.

Note: If you want to follow along with this example, you need to download the client library before you can use it. You can do so using the command pip install ubiops.

1) Setting up a deployment for training

For a training deployment we need to include the training code in the request() function. This is the part of the deployment that runs when a call to the deployment API is made. The code cell below shows an example of a deployment.py that trains a Tensorflow model with Keras.

This deployment expects both a file (an archive with training data) and the number of epochs as input, and returns the model parameter file and the metrics accuracy and loss as output.

The code cell below shows a shortened code example of a deployment.py file used for training. In this example some lines are omitted for clarity.

import numpy as np
import os
import tensorflow as tf
import joblib
import pathlib
import ubiops


class Deployment:

    def __init__(self, base_directory, context):
        '''Any code inside this method will run when the deployment container starts up.'''

        # Any print statement will end up in the logs of your UbiOps project
        print('Deployment initializing...')


    def request(self, data):
        '''All code inside this function will run when a call to the deployment is made.'''

        # Read the input variables. The 'data' dictionary holds all input variables passed in the call to the deployment.
        batch_size = 10
        num_classes = 5
        nr_epochs = data['nr_epochs']

        # Load the training data. Here we pass an archive as a file. You could also pass a URL to an object storage location.
        data_dir = tf.keras.utils.get_file(origin='file://'+data['training_data'])

        # Split the data in a training and validation set
        train_data = tf.keras.utils.image_dataset_from_directory(data_dir, batch_size, ...)
        val_data = tf.keras.utils.image_dataset_from_directory(data_dir, batch_size, ...)


        # Define and fit a model. In this example a Keras model.
        model = tf.keras.Sequential([
          # Any model configuration.
        ])

        model.compile(...)
        model.fit(
          train_data,
          validation_data=val_data,
          epochs=nr_epochs
        )

        # Evaluate the trained model
        evaluation_res = model.evaluate(val_data)


        # Return the trained model file as a binary (Pickle) and return the loss and accuracy metrics.
        joblib.dump(model, 'model.pkl')
        final_loss = evaluation_res[0]
        final_accuracy = evaluation_res[1]

        # The dictionary below will be returned through the API when the job is finished

        return {'model_file': 'model.pkl', 
                'loss': final_loss, 
                'accuracy': final_accuracy}

The code below can be used to define the deployment with the UbiOps Python Client:

deployment_template = ubiops.DeploymentCreate(
    name=DEPLOYMENT_NAME,
    input_type='structured',
    output_type='structured',
    input_fields=[
        {'name': 'nr_epochs', 
         'data_type': 'int'},
        {'name': 'training_data', 
         'data_type': 'blob'},
    ],
    output_fields=[
        {'name': 'model_file', 
         'data_type': 'blob'},
        {'name': 'loss', 
         'data_type': 'double'},
        {'name': 'accuracy', 
         'data_type': 'double'},
    ]
)

api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)

2) Creating a version and choosing an instance type

For this example an instance type with 4GB of memory is used. Picking a larger instance type will reduce the time, but will also consume more credits. The instance type we choose here uses 4 credits per hour.

Running training jobs on GPU

It's also possible to run your deployments on GPU, but only if you have a paid subscription

The example code below will create a deployment version running on a 4 GB instance type.

    version_template = ubiops.DeploymentVersionCreate(
    version=DEPLOYMENT_VERSION, # The name was specified when the connection between UbiOps and the notebook was made.
    language='python3.8',
    instance_type="4096mb",
    maximum_instances=1,
    minimum_instances=0,
    maximum_idle_time=300,

)

api.deployment_versions_create(project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template)

If you are using large libraries in your deployments, like Tensorflow, building could take a couple of minutes. You can check the progress by going to the UbiOps WebApp and then clicking the Logging button, in the menu bar on the left side. You can also use the UbiOps client library to show the logs directly in your notebook.

3) Starting a training run

After the deployment has finished building, it is able to receive requests to start a training run. In this case the deployment expects a blob named training_data as input and returns a trained_model and the score of the model named model_score as output.

For training a model it is advised to use batch requests, instead of direct requests. This is because training a model can take a lot of time, and batch requests are asynchronous and allowed to run for two full days.

Note: This example uses an instance type that has 4096MB of memory. This will reduce the time of training to around 5 minutes, which is still within the boundaries of a direct request. But since most training use cases will take longer we will show how to create a batch request in this example.

Since we're including a blob (file) as input for the deployment, two calls to the API are needed to start a training run:

  1. The first call will upload the blob (training data archive) to UbiOps.
  2. The second call makes the actual request to UbiOps, using the uploaded blob.

Upload the blob to UbiOps:

response = api.blobs_create(project_name=PROJECT_NAME, file= <BLOB_NAME>)
blob_id = response.id
print(blob_id)


# The API call (request) payload
data = {
    'nr_epochs': 2,
    'training_data_url': blob_id
}

After the blob has been uploaded to your UbiOps environment, you can make a batch request to the deployment. This request will start the training run.

# A batch request to the default version
request_response = api.batch_deployment_requests_create(
    project_name=PROJECT_NAME, 
    deployment_name=DEPLOYMENT_NAME, 
    data=[data]
)
print(request_response)

# Get the request id to retrieve the results
request_id = request_response[0].id

Monitoring the training convergence

If you want to follow the status of the training process, you can go to the UbiOps WebbApp and click the Logging button in the menu bar on the left.

It's also possible to receive a notification if the request is finished, you can do this by creating a notification group for your training deployment.

4) Retrieving the results

After the request is finished we need to make another call to the API to retrieve the results. Remember that this is only necessary when you have made a batch request.

 # TODO wait for the request to finish

# Retrieve the request results
request_results = api.deployment_requests_get(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    request_id=request_id
)

print(f'Request status: {request_results.status}')
print(f'Request output: {request_results.result}')

UbiOps stores the model parameter file as a blob. We can download it using the blob_id.

import pickle

# Download the model parameter file 
with api.blobs_get(PROJECT_NAME, request_results.result['model_file']) as response:
    filename = response.getfilename()
    content = response.read()

    pkl = open(filename,'wb')
    pickle.dump(content,pkl)
    pkl.close()

print(f'The model file is saved as: {filename}')