Training Checkpointing¶
This how-to provides instructions on incorporating model checkpointing into your UbiOps training script.
Both TensorFlow
and PyTorch
code snippets, along with general non-framework dependent snippets, will be provided.
Training functionality
Prior familiarity with the training functionality in UbiOps is recommended before proceeding with this how-to.
Checkpointing¶
In order to checkpoint a model during training, the model first needs to be saved and then uploaded to UbiOps. For more detailed information, metric logging can be implemented as well. This how-to will however only focus on the model saving and uploading.
TensorFlow checkpointing tutorial
For a comprehensive end-to-end tutorial covering the entire process (including logging) of checkpointing a model using the TensorFlow framework, refer to the TensorFlow checkpointing tutorial
Checkpoint timing¶
When training a model, an intermediary model can be saved at different times during the training process. Many different approaches are possible, but this how-to will only focus on saving after each epoch.
Tensorflow¶
In order to save the model after each epoch, the callbacks
argument of the model.fit
function can be used.
import tensorflow as tf
class UbiOpsCallback(tf.keras.callbacks.Callback):
def __init__(self, bucket_name, context):
super().__init__()
# TODO: Add initialization code
def on_epoch_end(self, epoch, logs=None):
"""
This function is called at the end of each epoch.
:param epoch: the epoch number
:param logs: the logs of the epoch
"""
# TODO: Save model and upload to UbiOps
def train(training_data, parameters, context):
"""
This function is called by UbiOps.
:param training_data: the training data
:param parameters: the parameters
:param context: the context
"""
# TODO: Add TensorFlow training code
model.fit(
training_data,
callbacks=[UbiOpsCallback(bucket_name, context)]
)
Pytorch¶
In order to save the model after each epoch, the torch.save
function can be used after each epoch.
import torch
def train(training_data, parameters, context):
"""
This function is called by UbiOps.
:param training_data: the training data
:param parameters: the parameters
:param context: the context
"""
# TODO: Add Pytorch training code
for epoch in range(epochs):
# TODO: Add training code
torch.save(model, saved_model_name)
# TODO: Upload model to UbiOps
Model Saving¶
The model can be saved in different ways, depending on the framework used.
To upload a model to UbiOps next to the training result, the following code can be used:
import ubiops
ubiops.utils.upload_file(
client=client,
project_name=project_name,
file_path=saved_model_name,
bucket_name=bucket_name,
file_name=f"deployment_requests/{context['id']}/checkpoints/"
f"{saved_model_name}_epoch_{epoch}.{saved_model_name.split('.')[-1]}",
)
Tensorflow¶
TensorFlow models can very easily be saved with the joblib
library.
import joblib
joblib.dump(model, saved_model_name)
Pytorch¶
Pytorch models can be saved with the torch.save
function.
import torch
torch.save(model, saved_model_name)