Pachyderm and UbiOps working as one: Train, Retrain, Deploy and Serve an Image Recognition Model

By 12 May 2021May 14th, 2021Blog, Collaborations

Pachyderm’s data science platform combines robust data versioning and data lineage engine with end-to-end pipelines. UbiOps delivers a free-to-use SaaS and on-prem platform for rapidly deploying machine learning models without in-depth IT knowledge.

In this notebook we walk through a simple integration between Pachyderm and UbiOps, showing how to train and retrain a model using Pachyderm and deploy it via UbiOps. 

Great, but why?


Data versioning and retraining models based on newly added data is a lot easier said than done. One of the key features of Pachyderm is incrementality.  If you’ve already trained on a terabyte of data and 1 GB of new data flows into your storage backend you don’t want to retrain on the entire terabyte.  Pachyderm lets you train only on the new data coming in unless you specifically ask for it to retrain on the entire data set.  

Pachyderm is a sophisticated tool that’s built on top of Kubernetes but Pachyderm Hub lets people use the complex functionality without having to know the ins and outs of K8s. Pachyderm Hub allows you to store data, update the repository periodically and trigger the UbiOps pipeline to run. The features and simplicity of Hub rhymes perfectly with UbiOps, which focuses on ease of use, hiding away unnecessary complexities like container management, Kubernetes cluster deployment and API creation.

We choose to showcase the integration with an image detection example because a platform like Pachyderm excels at big chunks of unstructured data like images and video. Our model predicts the age of the person in the image. We train the model on a big dataset generated from IMDB.

Prerequisites

Setup guide (step by step)

If you want the complete source of this integration please see this notebook file on GitHub. 

1. Give Pachyderm access to UbiOps

Pachyderm needs secured access to UbiOps in order to create a new deployment there. We do this with a Pachyderm secret. You can use this to set up credentials and passwords.

Pachyderm secret:



{
   "apiVersion": "v1",
   "kind": "Secret",
   "metadata": {
      "name": "ubiops"
   },
   "type": "Opaque",
   "stringData": {
      "token": "Token token"
   }
 }

Add your own token and save it to `secret.json`. You can read about how to create your own token here. Then you run the following command to create the secret:


!pachctl create secret -f secret.json

2. Create a pachyderm pipeline

Now we need to create a pipeline in Pachyderm to train the model based on the data that is stored there. Just like the token, the pachyderm pipeline is also defined by a JSON file. 


{
 "pipeline": {
   "name": "faces_train_model"
 },
 "datum_tries": 1,
 "description": "A pipeline that trains our neural network",
 "transform": {
   "build": {
     "image": "raoulfaselubiops/pachyderm-builder:latest", # We are using a custom Pachyderm builder pipeline here
     "path": "./"
   },
   "secrets": [ {
       "name": "ubiops",
       "mount_path": "/opt/ubiops"
   },
   ] 
  },

 "input": {
   "pfs": {
     "repo": "faces",
     "glob": "/*"
   }
 }
}

If you then save it to `pachy_source/build_pipeline.json` you can instantiate it using:


!pachctl update pipeline -f pachy_source/build_pipeline.json

Obviously, it needs more files for training, you can find them in the notebook. Note that in the pipeline YAML we refer to the pachyderm secret and repo we created earlier (step 1). 

3. Connect to UbiOps 

The next step is to create the Pachyderm connection to UbiOps SaaS, which runs on GC.
For that, we use the referenced token to connect to UbiOps


   with open('/opt/ubiops/token', 'r') as reader:
       API_TOKEN = reader.read()
   client = ubiops.ApiClient(ubiops.Configuration(api_key={'Authorization': API_TOKEN}, 
                                              host='https://api.ubiops.com/v2.1'))
   api = ubiops.CoreApi(client)

4. UbiOps deployment

Displaying the entire UbiOps deployment source code is out of scope for this article. You can find the entire file in the GitHub repository instead. It’s worth highlighting that an entry in the UbiOps.yaml installs some Linux dependencies without having to create a docker file. Read more about installing Linux dependencies in UbiOps here.

UbiOps.yaml:


apt:
 packages:
   - cmake
   - protobuf-compiler
   - build-essential
   - python3.8-dev

5. Uploading data to Pachyderm

Uploading data to pachyderm is easy. Take a look at the notebook for a script that downloads the IMDB data. Assuming you have the IMDB data in `data/imdb_crop/`:


!pachctl put file -p=30 --progress=false -r [email protected]:data/imdb_crop/00 -f data/imdb_crop/00

Once the data is uploaded to Pachyderm our Pachyderm pipeline should immediately start training our model and serve it on UbiOps. Time to sit back and relax! 

Wrap up 

While this integration is exemplified with an image recognition model, the integration between Pachyderm and UbiOps can be used for a variety of use cases in machine learning. UbiOps is especially useful for frequent and relatively small requests, while Pachyderm Hub allows you to efficiently and easily store and process image or video data. 

In case of any questions or remarks please join the UbiOps slack or the Pachyderm slack. To download the full notebook click here