Skip to content

Training Checkpointing

This how-to provides instructions on incorporating model checkpointing into your UbiOps training script.
Both TensorFlow and PyTorch code snippets, along with general non-framework dependent snippets, will be provided.

Training functionality

Prior familiarity with the training functionality in UbiOps is recommended before proceeding with this how-to.

Checkpointing

In order to checkpoint a model during training, the model first needs to be saved and then uploaded to UbiOps. For more detailed information, metric logging can be implemented as well. This how-to will however only focus on the model saving and uploading.

TensorFlow checkpointing tutorial

For a comprehensive end-to-end tutorial covering the entire process (including logging) of checkpointing a model using the TensorFlow framework, refer to the TensorFlow checkpointing tutorial

Checkpoint timing

When training a model, an intermediary model can be saved at different times during the training process. Many different approaches are possible, but this how-to will only focus on saving after each epoch.

Tensorflow

In order to save the model after each epoch, the callbacks argument of the model.fit function can be used.

import tensorflow as tf

class UbiOpsCallback(tf.keras.callbacks.Callback):
    def __init__(self, bucket_name, context):
        super().__init__()
        # TODO: Add initialization code

    def on_epoch_end(self, epoch, logs=None):
        """
        This function is called at the end of each epoch.

        :param epoch: the epoch number
        :param logs: the logs of the epoch
        """

        # TODO: Save model and upload to UbiOps

def train(training_data, parameters, context):
    """
    This function is called by UbiOps.

    :param training_data: the training data
    :param parameters: the parameters
    :param context: the context
    """

    # TODO: Add TensorFlow training code
    model.fit(
        training_data,
        callbacks=[UbiOpsCallback(bucket_name, context)]
    )

Pytorch

In order to save the model after each epoch, the torch.save function can be used after each epoch.

import torch

def train(training_data, parameters, context):
    """
    This function is called by UbiOps.

    :param training_data: the training data
    :param parameters: the parameters
    :param context: the context
    """

    # TODO: Add Pytorch training code

    for epoch in range(epochs):
        # TODO: Add training code

        torch.save(model, saved_model_name)

        # TODO: Upload model to UbiOps

Model Saving

The model can be saved in different ways, depending on the framework used.
To upload a model to UbiOps next to the training result, the following code can be used:

import ubiops

ubiops.utils.upload_file(
            client=client,
            project_name=project_name,
            file_path=saved_model_name,
            bucket_name=bucket_name,
            file_name=f"deployment_requests/{context['id']}/checkpoints/"
                      f"{saved_model_name}_epoch_{epoch}.{saved_model_name.split('.')[-1]}",
        )

Tensorflow

TensorFlow models can very easily be saved with the joblib library.

import joblib

joblib.dump(model, saved_model_name)

Pytorch

Pytorch models can be saved with the torch.save function.

import torch

torch.save(model, saved_model_name)