Training ML models on UbiOps

Training Machine Learning models in the cloud from scratch can be a challenging task. In this post we will dive into why UbiOps is not only useful for running and scaling model inference, but can also be used to run training jobs for Machine Learning models.

UbiOps has a built-in functionality for managing and running model training jobs. Training jobs on UbiOps are allowed to run non-stop for multiple days. Also you can choose from both CPU as well as GPU nodes to run your training code and select different node sizes to fit your workload. This makes UbiOps a great place to offload training scripts that take too long or require too many resources for your local machine. UbiOps provides compute resources on-demand, so you only pay for the time your job is running.

In this blog post we will give you an overview on how to create a training job and run it in the cloud using UbiOps. This structure can be used for pretty much all frameworks, like Tensorflow, Keras, Pytorch, Scikit Learn and others.If you are not familiar with the UbiOps platform: UbiOps helps you turn your Python code into a microservice with its own API and runs and scales this service in the cloud. You can use this to run data science models as services, use them in the background of a website or app and let them scale based on the number of calls. UbiOps can also be used to kick-off training jobs on-demand in the cloud, with the fast iteration cycle that a developer wants. It is easy to interact with the file system, to store e.g. intermediate checkpoints of models.

To read more about the platform, go to https://ubiops.com/product/. 

UbiOps takes care of containerizing your code, assigning it the right resources and scaling the number of instances as needed. It also takes care of API management, authentication and workflow orchestration. You only have to think about the code itself you want to run.

Defining the training job

The layout of a training job in UbiOps works with some fixed and some flexible in-and output fields. The training job is defined by a training code. The code takes as an input a dataset. Also, we can add parameters to the training job, such as the number of training epochs, the batch size and other hyperparameters that we want to try out. As output we want to return the trained model file, the final loss & accuracy metrics and any other custom metrics. This way, we can measure performance and compare the results from this training run to those from other runs. Additionally, we can output additional output files that provide extra information about the performance of our final model, such as a confusion matrix.

A training job in UbiOps is called a run. To run the training job on UbiOps, we need to create a file named `train.py` and include our code here. This code will execute as a single ‘Run’ as part of an ‘Experiment’. An `Experiment` can contain multiple training runs. Training runs inside the `Experiment` run on top of an `Environment`. The `Environment` contains an instance type (hardware) and code dependencies.We made a Jupyter Notebook that holds all necessary training code for a Tensorflow example and has the commands to deploy it to a UbiOps account. If you want to see the full example and run it yourself, you can use the following Google Colab notebook:

Training a Tensorflow model on UbiOps – Google Colab

Preparing your training code for UbiOps

Let’s first take a look at the training script. The UbiOps `train.py` structure is quite simple. It only needs to contain a train() function, with input parameters `training_data` (a file path to your  training data) and `parameters`(a dictionary that contains the parameters of your choice).

When training a model, we need to load our training data set. There are two ways to get the training data as input to the training run when we start the run.

  • The first option is to send it as a file via the File Storage system. Using this approach, we can import large data files directly into our training job. The data file can be added directly as an input of the input variable ‘training_code’. This is what we have implemented in our example code. Note that you can connect your already existing bucket to UbiOps, if the bucket is located on AWS, Azure or GCP. Check our docs to see how you can set this up.
  • The second option is to add a reference to the dataset as an input parameter. This can be useful if you make use of an online dataset or if you have stored your dataset on an external location, such as on Snowflake. Tip: you can use the authentication secrets securely in your training code by using `environment variables` .

The same holds for storing output data. You can either retrieve this through the UbiOps File Storage system, or store it directly from your training code in an external storage system using a client.

Our final training code looks as follows!

%%writefile train.py

import os
import tensorflow as tf
import joblib
import pathlib
import shutil
import tarfile

def train(training_data, parameters, context = {}):

    '''All code inside this function will run when training job is initiated.'''

    img_height = 180
    img_width = 180
    batch_size = int(parameters['batch_size']) #Specify the batch size
    nr_epochs = int(parameters['nr_epochs']) #Specify the number of epochs
 

    # Load the training data

    extract_dir = "flower_photos"
    with tarfile.open(training_data, 'r:gz') as tar:
      tar.extractall("./")

    data_dir = pathlib.Path(extract_dir)
    train_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="training",
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size)

    val_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="validation",
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size
        )

    class_names = train_ds.class_names
    print(class_names)

    # Standardize the data

    normalization_layer = tf.keras.layers.Rescaling(1./255)
    normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
    image_batch, labels_batch = next(iter(normalized_ds))

    # Configure the dataset for performance

    AUTOTUNE = tf.data.AUTOTUNE
    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

    # Train the model

    num_classes = 5
    model = tf.keras.Sequential()

    model.compile(
        optimizer='adam',
        loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy']
    )

    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=nr_epochs    
    )

    eval_res = model.evaluate(val_ds)
   

    # Return the trained model file and metrics
    joblib.dump(model, 'model.pkl')
    fin_loss = eval_res[0]
    fin_acc = eval_res[1]

   
    print(history)
    print(history.history)

    return {

        "artifact": 'model.pkl',
        "metrics": {'fin_loss' : fin_loss,
                    'fin_acc' : fin_acc,
                    "loss_history": history.history["loss"],
                    "acc_history" : history.history["accuracy"]},
        }

If we apply this training code, along with a zipped `training_data` file and some values for our input parameters, a training run is initiated! Each training run can either reuse the same code with different parameters, or contain a different version of the `train.py` file. Multiple training runs can run in parallel. Let us first discuss `Environments` and `Experiments` in more detail first.

Creating the environment and experiment

The basis for model training in UbiOps is an ‘Experiment‘. An experiment has a fixed code environment and hardware (instance) definition, but it can hold many different ‘Runs’. You can create an experiment in the WebApp or use the Python client, as we’re doing in the notebook.

For this example, we create an environment named ‘’python3-10-tensorflow-training’ and provide, select Python 3.10 and upload a `requirements.txt` which contains the relevant dependencies.Then we can create an ‘Experiment’ which contains our ‘environment’, and an instance type. We can also add any environment variables here. We use a CPU instance with 4GB memory.

Starting the training run and monitoring progress

Now we have all components that are required to start our training runs. We can either do this using the UbiOps WebApp (UI) or with the help of the UbiOps Python client. We can initiate a training run with a single command!

new_run = training_instance.experiment_runs_create(
 project_name=PROJECT_NAME,
 experiment_name=EXPERIMENT_NAME,
 data=ubiops.ExperimentRunCreate(
 name=RUN_NAME,
 description='Trying out a first run run',
 training_code= RUN_SCRIPT,
 training_data= training_data,

        parameters={
 "nr_epochs": 2, # example parameters
 "batch_size" : 32
 },
 timeout=14400
 )
)

We can easily finetune our training code and execute a new training code, and analyze the logs along the way.

When training a model it is important to keep track of the training progress and convergence. We do this by looking at the training loss and accuracy metrics. Packages like Tensorflow will print these for you continuously, and we’re able to track them in the logging page of the UbiOps UI.

If you notice a training job is not converging, you’re able to cancel the job and try it again with different data or different parameters.For jobs that take a long time to run, you can set email alerts to notify you when the job is finished.

Evaluating the output

When the training runs are completed, the training run will provide you with the trained parameter file, the final accuracy and loss. The parameter file is stored inside a UbiOps bucket. You can easily navigate to this location from the training-run interface

You can compare performance metrics of different training runs easily inside the Evaluation page of the Training tab in the UI, allowing you to analyze which code or which hyperparameters worked best.And that’s it, you just trained a couple of Tensorflow models on UbiOps, and can use the best model downstream your MLOps pipeline.

What will you build next?

This is of course only an example of what you can do with UbiOps. You can extend the code in any way you like, configure it to use different input/output variables or use code that’s fully yours.

Apart from just applying a training code to your experiment, you can also apply a training directory. Here you can include your own Python code files and libraries, artifacts and much more to run the job that you want. From running a simple example training job, like we did in this example, to using your own custom libraries that are shared with your team members.

If you want to learn more, visit our documentation pages and the How-to & Tutorials section for more inspiration.

Latest news

Turn your AI & ML models into powerful services with UbiOps