Skip to content

Train a model on UbiOps

UbiOps is mainly built for models that have already been trained and are ready to run in production. But it is also possible to use UbiOps to train a model, by creating a deployment which takes training data as input and returns a trained model file as output.

This page explains the key differences between a normal deployment and a training deployment, and how you can create a training deployment yourself. You can also use this example to create a deployment that retrains an existing model.

Creating a deployment for training

A deployment used for training is pretty similar to a deployment for running a model. But instead of inference code, we need to include the training code in the request() function of the deployment.

As input for a training deployment we want to pass the training data set and any (hyper)parameters that we want to adjust for each training run. As output we want to return the trained model parameter file and metrics about the training run.

In this example we configure a deployment with the input and output variables and data types as shown in the table below:

Deployment input & output variables
Variable name Data type
Input fields nr_epochs int
training_data file
Output fields model_file file
loss double (float)
accuracy double (float)

Specifying training parameters as deployment input has another benefit: you can run simultaneous training runs to test which parameters give the best result.

Loading the training data

Sending training data in a request to a deployment can be done in two ways.

  • The first option is to send it as a file to the deployment through the UbiOps API. You can do this by configuring your deployment to have the file data type as input. Note that you can also use UbiOps buckets that are connected to your own cloud storage for this! See working with files for more information.

  • The second option is to load the data from an external (object) storage platform like Amazon S3, Google Cloud Storage or Snowflake directly into your deployment. This can be achieved by including the code for calling the storage API inside the request() function. This allows you to leverage the performance of existing storage systems and work with much larger data files. Tip: you can use the authentication secrets securely in your deployment code by using environment variables

The same holds for storing output data. You can either retrieve this through the UbiOps API or store it directly from your deployment code in an external storage system using a client.

Scaling up

If you want to start multiple training runs simultaneously you can set the parameter max_instances to any number higher than one, depending on how many jobs you want to run in parallel and make multiple requests to the deployment. Now they will be executed simultaneously or queued until the next instance is available. You can see the status of each request in the Requests tab in the Monitoring section.

Example

Next, the example will show how you can set up a deployment in such a way that it trains a model, and how to automate this process, by using the Client Library and the Webapp.

Note: If you want to follow along with this example, you need to download the client library before you can use it. You can do so using the command pip install ubiops.

1) Setting up a deployment for training

For a training deployment we need to include the training code in the request() function. This is the part of the deployment that runs when a call to the deployment API is made. The code cell below shows an example of a deployment.py that trains a Tensorflow model with Keras.

This deployment expects both a file (an archive with training data) and the number of epochs as input, and returns the model parameter file and the metrics accuracy and loss as output.

The code cell below shows a shortened code example of a deployment.py file used for training. In this example some lines are omitted for clarity.

import numpy as np
import os
import tensorflow as tf
import joblib
import pathlib


class Deployment:

    def __init__(self, base_directory, context):
        '''Any code inside this method will run when the deployment container starts up.'''

        # Any print statement will end up in the logs of your UbiOps project
        print('Deployment initializing...')


    def request(self, data):
        '''All code inside this function will run when a call to the deployment is made.'''

        # Read the input variables. The 'data' dictionary holds all input variables passed in the call to the deployment.
        batch_size = 10
        num_classes = 5
        nr_epochs = data['nr_epochs']

        # Load the training data. Here we pass an archive as a file. You could also pass a URL to an object storage location.
        data_dir = tf.keras.utils.get_file(origin='file://'+data['training_data'])

        # Split the data in a training and validation set
        train_data = tf.keras.utils.image_dataset_from_directory(data_dir, batch_size, ...)
        val_data = tf.keras.utils.image_dataset_from_directory(data_dir, batch_size, ...)


        # Define and fit a model. In this example a Keras model.
        model = tf.keras.Sequential([
          # Any model configuration.
        ])

        model.compile(...)
        model.fit(
          train_data,
          validation_data=val_data,
          epochs=nr_epochs
        )

        # Evaluate the trained model
        evaluation_res = model.evaluate(val_data)


        # Return the trained model file as a binary (Pickle) and return the loss and accuracy metrics.
        joblib.dump(model, 'model.pkl')
        final_loss = evaluation_res[0]
        final_accuracy = evaluation_res[1]

        # The dictionary below will be returned through the API when the job is finished

        return {'model_file': 'model.pkl', 
                'loss': final_loss, 
                'accuracy': final_accuracy}

The code below can be used to define the deployment with the UbiOps Python Client:

import ubiops

configuration = ubiops.Configuration()
configuration.api_key['Authorization'] = 'Token <YOUR_API_TOKEN>'
api = ubiops.ApiClient(configuration)

deployment_template = ubiops.DeploymentCreate(
    name=DEPLOYMENT_NAME,
    input_type='structured',
    output_type='structured',
    input_fields=[
        {'name': 'nr_epochs', 
         'data_type': 'int'},
        {'name': 'training_data', 
         'data_type': 'file'},
    ],
    output_fields=[
        {'name': 'model_file', 
         'data_type': 'file'},
        {'name': 'loss', 
         'data_type': 'double'},
        {'name': 'accuracy', 
         'data_type': 'double'},
    ]
)

api.deployments_create(project_name='YOUR_PROJECT_NAME', data=deployment_template)

2) Creating a version and choosing an instance type

For this example an instance type with 4GB of memory is used. Picking a larger instance type will reduce the time, but will also consume more credits. The instance type we choose here uses 4 credits per hour.

Running training jobs on GPU

It's also possible to run your deployments on GPU, but only if you have a paid subscription

The example code below will create a deployment version running on a 4 GB instance type.

    version_template = ubiops.DeploymentVersionCreate(
    version='YOUR_DEPLOYMENT_VERSION',
    language='python3.8',
    instance_type="4096mb",
    maximum_instances=1,
    minimum_instances=0,
    maximum_idle_time=300,

)

api.deployment_versions_create(project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template)

If you are using large libraries in your deployments, like Tensorflow, building could take a couple of minutes. You can check the progress by going to the UbiOps WebApp and then clicking the Logging button, in the menu bar on the left side. You can also use the UbiOps client library to show the logs directly in your notebook.

3) Starting a training run

After the deployment has finished building, it is able to receive requests to start a training run. In this case the deployment expects a file training_data as input and returns a trained_model and the score of the model named model_score as output.

For training a model it is advised to use batch requests, instead of direct requests. This is because training a model can take a lot of time, and batch requests are asynchronous and allowed to run for two full days.

Note: This example uses an instance type that has 4096MB of memory. This will reduce the time of training to around 5 minutes, which is still within the boundaries of a direct request. But since most training use cases will take longer we will show how to create a batch request in this example.

Since we're including a file as input for the deployment, two calls to the API are needed to start a training run:

  1. The first call will upload the file (training data archive) to UbiOps.
  2. The second call makes the actual request to UbiOps, using the uploaded file.

Upload the file to UbiOps:

bucket_name = 'bucket_name_example'
file_input = 'file_example.png'

# Upload a file
file_uri = ubiops.utils.upload_file(api, PROJECT_NAME, file_input, bucket_name)

# Configuring the request payload
data = {
    'nr_epochs': 2,
    'training_data_url': file_uri
}

After the file has been uploaded to your UbiOps environment, you can make a batch request to the deployment. This request will start the training run.

# A batch request to the default version
request_response = api.batch_deployment_requests_create(
    project_name=PROJECT_NAME, 
    deployment_name=DEPLOYMENT_NAME, 
    data=[data]
)
print(request_response)

# Get the request id to retrieve the results
request_id = request_response[0].id

Monitoring the training convergence

If you want to follow the status of the training process, you can go to the UbiOps WebbApp and click the Logging button in the menu bar on the left.

It's also possible to receive a notification if the request is finished, you can do this by creating a notification group for your training deployment.

4) Retrieving the results

After the request is finished we need to make another call to the API to retrieve the results. Remember that this is only necessary when you have made a batch request.

# Retrieve the request results
request_results = api.deployment_requests_get(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    request_id=request_id
)

print(f'Request status: {request_results.status}')
print(f'Request output: {request_results.result}')

UbiOps stores the model parameter file inside the default storage bucket. We can download it using the file URI.

import pickle
import os

# Download the model parameter file 
file_uri = request_results.result['model_file']
ubiops.utils.download_file(
    api,
    project_name,
    file_uri=file_uri,
    output_path='.',
    stream=True,
    chunk_size=8192
)
filename = os.path.basename(file_uri)

print(f'The model file is saved as: {filename}')