Retrain ResNet using PyTorch¶
Download notebook View source code
In this example, we show how to retrain a PyTorch model on UbiOps. In this end-to-end example, we first set-up an Environment
in which our training jobs can run. Then we define a train.py
script that we can apply to our Environment
. The training script imports ResNet with pretrained weights, and retrains that model on the CIFAR-10 dataset. Finally, we benchmark its performance on the test set, and add that as a metric
to our output. Snippets from this workflow can be used to retrain your own models.
Let us first install the UbiOps Python client.
!pip install "ubiops >= 3.15"
1) Set project variables and initialize the UbiOps API Client¶
First, make sure you create an API token with project editor permissions in your UbiOps project and paste it below. Also fill in your corresponding UbiOps project name.
from datetime import datetime
import yaml
import os
import ubiops
dt = datetime.now()
API_TOKEN = 'Token ' # Paste your API token here. Don't forget the `Token` prefix
PROJECT_NAME = '' # Fill in the corresponding UbiOps project name
configuration = ubiops.Configuration(host="https://api.ubiops.com/v2.1")
configuration.api_key['Authorization'] = API_TOKEN
api_client = ubiops.ApiClient(configuration)
core_instance = ubiops.CoreApi(api_client=api_client)
training_instance = ubiops.Training(api_client=api_client)
print(core_instance.service_status())
Set-up a training instance in case you have not done this yet in your project. This action will create a base training deployment, that is used to host training experiments.
training_instance = ubiops.Training(api_client=api_client)
try:
training_instance.initialize(project_name=PROJECT_NAME)
except ubiops.exceptions.ApiException as e:
print(f"The training feature may already have been initialized in your project:\n{e}")
Defining the code environment¶
Our training code needs an environment to run in, with a specific Python language version, and some dependencies, like PyTorch
. You can create and manage environments in your UbiOps project. We create an environment named python3-11-pytorch-retraining
, select Python 3.11 and upload a requirements.txt which contains the relevant dependencies.
The environment can be reused and updated for different training jobs (and deployments!). The details of the environment are visible in the 'environments' tab in the UbiOps UI.
training_environment_dir = 'training_environment'
ENVIRONMENT_NAME = 'python3-11-pytorch-retraining'
%mkdir {training_environment_dir}
%%writefile {training_environment_dir}/requirements.txt
torch==1.13.1
torchvision==0.14.1
import shutil
training_environment_archive = shutil.make_archive(f'{training_environment_dir}', 'zip', '.', f'{training_environment_dir}')
# Create experiment. Your environment is set-up in this step. It may take some time to run.
try:
api_response = core_instance.environments_create(
project_name=PROJECT_NAME,
data=ubiops.EnvironmentCreate(
name=ENVIRONMENT_NAME,
#display_name=ENVIRONMENT_NAME,
base_environment='python3-11',
description='Training environment with Python 3.11 and PyTorch 1.13 for Resnet retraining',
)
)
core_instance.environment_revisions_file_upload(
project_name=PROJECT_NAME,
environment_name=ENVIRONMENT_NAME,
file=training_environment_archive
)
except ubiops.exceptions.ApiException as e:
print(e)
Configure an experiment¶
The basis for model training in UbiOps is an 'Experiment'. An experiment has a fixed code environment and hardware (instance) definition, but it can hold many different 'Runs'. You can create an experiment in the WebApp or use the client library, as we do here.
This bucket will be used to store your training jobs, model artifacts and any other files that are created during the training run.
EXPERIMENT_NAME = 'retrain-resnet-pytorch' # str
BUCKET_NAME = 'default'
try:
experiment = training_instance.experiments_create(
project_name=PROJECT_NAME,
data=ubiops.ExperimentCreate(
instance_type_group_name='4096 MB + 1 vCPU',
description='Retrain the ResNet model on CIFAR-10 data',
name=EXPERIMENT_NAME,
environment=ENVIRONMENT_NAME,
default_bucket= BUCKET_NAME
)
)
except ubiops.exceptions.ApiException as e:
print(e)
Define and start a training run¶
A training job in UbiOps is called a run. To run Python code for training on UbiOps, we need to create a file named train.py
and include our training code here. This code will execute as a single 'Run' as part of an 'Experiment' and uses the code environment and instance type (hardware) as defined with the experiment as shown before.
Let’s take a look at the training script. The train.py
script requires a train()
function, with input parameters training_data
(a file path to your training data) and parameters
(a dictionary that contains parameters of your choice). More detailed information on the training code format can be found in the UbiOps training documentation.
In this example, we will download the CIFAR-10
dataset using the torchvision
package during the training process, so there is no need to upload our own dataset.
Now that we have our environment
and experiment
set-up, it is easy to initiate runs. The RUN_NAME
and RUN_SCRIPT
can easily be tweaked in the next two cells, and sent to the relevant experiment
in the cell after.
RUN_NAME = 'training-run'
RUN_SCRIPT = f'{RUN_NAME}.py'
%%writefile {RUN_SCRIPT}
import json
import os
import torch
import torchvision
import time
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
class Net(nn.Module):
def __init__(self):
super().__init__()
#Preload resnet. Supress logs while importing
self.model = models.resnet50(weights='ResNet50_Weights.DEFAULT', progress = False)
self.loss = nn.CrossEntropyLoss()
#Apply our optimizer
self.optimizer = optim.SGD(self.model.parameters(), lr = 0.01, momentum = 0.9)
def forward(self, x, target=None):
x = self.model(x)
if self.training:
loss = self.loss(x, target)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()
return x, loss
else:
return x
def train(training_data, parameters, context):
# Check the availability of a GPU (this tutorial focuses on a CPU instance,
# but can be extended to run on a GPU instance)
print(torch.cuda.is_available())
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Get batch size from input parameters
batch_size = parameters['batch_size']
epochs = int(parameters['epochs'])
print(f"Unpacked parameters {parameters}")
# Create data input transformer
transform = transforms.Compose(
[transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
# Select the dataset from torchvision
trainset = torchvision.datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=transform
)
trainloader = torch.utils.data.DataLoader(
trainset,
batch_size=batch_size,
shuffle=True,
drop_last=True,
num_workers=2
)
testset = torchvision.datasets.CIFAR10(
root='./data',
train=False,
download=True,
transform=transform
)
testloader = torch.utils.data.DataLoader(
dataset=testset,
batch_size=batch_size,
shuffle=False,
drop_last = True,
num_workers=2
)
classes = ('plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck')
net = Net()
net.to(device)
print(f"Moved Resnet model to {device}")
print("Starting the model training!")
for epoch in range(epochs): # loop over the dataset multiple times
running_loss = 0.0
for _ , data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
_, loss = net(inputs, labels)
# print statistics
running_loss += loss.item()
print(f'Epoch [{epoch + 1}/{epochs}], Loss: {running_loss / len(trainloader):.3f}')
print("Finished model training")
model_path = "./cifar_net.pth"
# Return the trained model
torch.save(net.state_dict(), model_path)
print(f"Saved model to {model_path} ")
print("Evaluating the model performance")
testnet = Net()
testnet.to(device)
testnet.load_state_dict(torch.load(model_path))
testnet.eval()
# Test accuracy
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
for data in testloader:
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
# calculate outputs by running images through the network
outputs = testnet(inputs)
# the class with the highest energy is what we choose as prediction
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Accuracy of the retrained Resnet model on all 10000 test images: {100 * correct // total} %')
run_id = context['id']
return {
"artifact": {
"file": "cifar_net.pth",
"bucket": os.environ.get("SYS_DEFAULT_BUCKET", "default"),
"bucket_file": f"{run_id}/cifar_net.pth"
},
"metrics": json.dumps({
"accuracy": 100 * correct // total
})
}
Now we initiate the training run. Do note that each epoch takes around 15 minutes to finish on a 4GB CPU instance. For demonstration purposes, we will run 1 epoch only, but feel free to increase this number if you have the time. The workload is running in the cloud, so there is no need to keep your local machine on.
new_run = training_instance.experiment_runs_create(
project_name=PROJECT_NAME,
experiment_name=EXPERIMENT_NAME,
data=ubiops.ExperimentRunCreate(
name=RUN_NAME,
description='First try!',
training_code= RUN_SCRIPT,
training_data= None,
parameters={
'epochs': 1, # example parameters
"batch_size" : 32
}
),
timeout=14400
)
Analyse the logs while training¶
One way to measure our model performance during training is to check the logs. We can do so in the UI, or by using the relevant API endpoint. To format the the logs in a pretty way, we will use the pprint
library.
import pprint
from datetime import datetime
current_datetime = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
logs = core_instance.projects_log_list(
project_name = PROJECT_NAME,
data = {
"date_range": -86400, # Get results between current_datetime and 86400 seconds before
"filters": {
"deployment_name": "training-base-deployment",
"deployment_request_id": new_run.id, #
"deployment_version": EXPERIMENT_NAME,
# "system": False # Optional filter to enable/disable system-level logs, see docs: "https://ubiops.com/docs/monitoring/logging/#system-logs"
},
"limit": 100,
"date": current_datetime,
})
logs_body = {log.log for log in logs}
pprint.pprint(logs_body, indent = 1)
Wrapping up¶
So that's it! We have created a set-up where we can retrain ResNet on UbiOps using the PyTorch library. The training script, model artifact and output metric are stored on UbiOps. This creates a proper basis for improving the accuracy of our final custom model.
Let us close the connection to the UbiOps API
core_instance.client_close()