CI/CD for Jupyter Notebooks

Continuous Integration and Continuous Delivery practices with Jupyter notebooks and Google Cloud.

A long long time ago, when mammoths were still peacefully eating the grass, software industry has come to some best practices, in particular, Continuous Integration (CI) and Continuous Delivery/Deployment (CD). In fact, CI/CD is not just a collection of practices. It’s more a culture, set of operating principles, allowing teams to focus on the product excellence rather than on the operational work.

AI and ML space is still in the era of development of these best practices. This article demonstrates some of the use cases and considerations around Jupyter notebooks that you may find helpful.

Reproducible Jupyter Notebooks

The major difficulty of building CI/CD for Jupyter notebooks is the fact that generally the notebook will only work in the same environment where it was was created. There are many aspects of this problem, but the main one is the notebook dependencies.

So, the first step in building CI/CD for Jupyter notebooks is to make sure that we can reliably re-run the notebooks.

The topic of Jupyter notebooks reproducibility is big enough to deserve a dedicated post.

But the generic idea is that the Jupyter stores information about its Deep Learning environment, as well as list of additional dependencies into the notebook metadata. (this functionality currently works only in the test Deep Learning images of Google Cloud AI)

Continuous Integration of Jupyter Notebooks

Let’s assume that you store you notebooks in some code repository, like git.

Then, your logical wish would be to have some sort of CI which, at very least, validates that all committed notebooks are “functional” and don’t have syntax errors.

You can find an example of such a CI in this sample project.

The major piece of this project is a Cloud Build configuration file.

This build configuration has just 3 steps.

First two steps clone the repository and checkout the commit which triggered the build:

- name: 'gcr.io/cloud-builders/git'
  id: "clone"
  args: ['clone', '--recurse-submodules', '--branch', 'repro_nb_based_pipeline', 'https://github.com/gclouduniverse/notebooks-ci-showcase']
- name: 'gcr.io/cloud-builders/git'
  id: 'checkout'
  args: ['checkout', '$COMMIT_SHA']
  dir: 'notebooks-ci-showcase'

The last step installs the gcloud-notebook-training tool, and then submits Google Cloud AI Training Job for each of the notebooks from the commit.

- name: 'docker.io/library/python:3.7'
  id: 'train'
  args: ['bash', '-c', 'pip install gcloud-notebook-training && git show --name-only $COMMIT_SHA | grep -i .ipynb | xargs -I gcloud-notebook-training --input-notebook']
  dir: 'notebooks-ci-showcase'
  timeout: 7200s

We are using python 3.7 Docker container here, and the gcloud-notebook-training tool.

gcloud-notebook-training is a simple tool which accepts the following parameters:

gcloud-notebook-training [-h] --input-notebook INPUT_NOTEBOOK
              [--project-id PROJECT_ID]
              [--output-notebook OUTPUT_NOTEBOOK]
              [--job-id JOB_ID]
              [--region REGION]
              [--worker-machine-type WORKER_MACHINE_TYPE]
              [--bucket-name BUCKET_NAME]
              [--max-running-time MAX_RUNNING_TIME]
              [--container-uri CONTAINER_URI]
              [--accelerator-type ACCELERATOR_TYPE]

You can find the sources of this tool as well as its description here.

gcloud-notebook-training is able to read meta-information from the notebook about its environment. It allows the tool to submit the training job for the notebook with the same DL environment, that was used to create and run this notebook.

The last step is to hook this cloudbuild definition to our git repository. It can be easily done on the GCP Cloud Build page.

Continuous Delivery of Jupyter Notebooks

Let’s consider a different scenario. We have notebooks which define different stages of some ML pipeline. We can have separate notebooks for data cleaning, data processing, training the model, testing the model and so on.

And again, we want to run the entire chain of the notebooks every time when someone changes one of the notebooks. We may also want to run the entire chain daily, based on a schedule.

This example demonstrates Continuous Delivery solution based on the Jupyter notebooks.

We assume here that each of the notebooks in the chain is “self-contained”, meaning that it can run independently and that it downloads input and uploads output data to some external storage, in our example it’s GCS.

Like in the previous example we submit our notebooks to Cloud AI Training Job. The only difference is that we submit them sequentially and then, at the end of the flow, we submit the result to Cloud AI Prediction services (inference).

- name: 'gcr.io/deeplearning-platform-release/base-cpu:m39'
  id: 'deploy'
  dir: 'notebooks-ci-showcase'
  args: ['$COMMIT_SHA']
  entrypoint: './deploy_model_for_inference.sh'

The core part of the deploy_model_for_inference.sh is the following call:

gcloud ai-platform versions create "${VERSION_NAME}" \
  --model "${MODEL_NAME}" \
  --origin "${GCS_MODEL_DIR}" \
  --runtime-version=1.14 \
  --framework "${FRAMEWORK}" \
  --python-version=3.5 \
  --project "${PROJECT_ID}"

The above example contains only two simple notebooks in the chain, but it demonstrates how Continuous Delivery can be built based on Jupyter notebooks.

Notebooks based Continuous Delivery may be not the optimal solution in big projects. Kubeflow Pipelines, TensorFlow Extended Pipelines may be better solutions for many (or even most) cases.

However, in many scenarios Notebooks based Continuous Delivery has its own advantages, mainly due to its simplicity.