AI & Machine Learning

Reduce the costs of ML workflows with preemptible VMs and GPUs

kubeflow.jpg

It's becoming increasingly straightforward and useful to make machine learning (ML) part of your business processes, but running ML jobs can be costly. For example, you may have periodically scheduled—often daily—training or tuning runs, or regularly run large-scale hyperparameter tuning experiments. These runs, however, often don’t need to finish within a fixed time window. If you have some flexibility in your timing and availability, preemptible VMs with GPUs can be a useful tactic for significantly lowering costs.

In this post, we’ll show you how to use preemptible GPU-provisioned VMs when running Kubeflow Pipelines jobs, to do exactly that. We’ll also look at how you can use Stackdriver Monitoring to inspect logs for both current and terminated pipeline operations.

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. Because of this, preemptible VMs are priced lower than standard Compute Engine VMs. With Google Kubernetes Engine (GKE), it is easy to set up a cluster or node pool that uses preemptible VMs. You can set up such a node pool with GPUs attached to the preemptible instances, which work the same way as regular GPU-enabled nodes, but the GPUs persist only for the life of the instance.

Kubeflow is an open-source project dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable. Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning workflows based on Docker containers.

If you’re running Kubeflow on GKE, it is now easy to define and run Kubeflow Pipelines in which one or more pipeline steps (components) run on preemptible nodes, reducing the cost of running a job. To get correct results when using preemptible VMs, the steps that you identify as preemptible should either be idempotent (that is, if you run a step multiple times, it will have the same result), or should checkpoint work so that the step can pick up where it left off if it gets interrupted.

For example, a copy of a Google Cloud Storage directory will have the same results if it’s interrupted and run again, assuming the source directory is unchanging. An operation to train a machine learning model (e.g., a TensorFlow training run) will typically be set up to checkpoint periodically, so if the training is interrupted by preemption, it can just pick up where it left off when the step is restarted. Most ML frameworks make it easy to support checkpointing, so if your Kubeflow pipeline includes model training steps, these are great candidates for running on preemptible GPU-enabled nodes.
A Kubeflow Pipelines job.png
A Kubeflow Pipelines job that is using preemptible VMs for its training. When a training step is terminated due to preemption, it is restarted, and picks up where it left off using checkpoint information.

When you’re running a pipeline and a cluster node is preempted, any pods running on that node will be terminated as well. GCP’s Stackdriver logging makes it easy to look at the logs from these terminated pods to find out more information.

For the rest of this post, we’ll look at how to use preemptible VMs with Kubeflow Pipelines, and how to inspect pipeline steps using Stackdriver.

Set up a preemptible GPU node pool in your GKE cluster
You can set up a preemptible, GPU-enabled node pool for your cluster by running a command similar to what we have below. Just edit the command with your cluster name and zone, and adjust the accelerator type and count according to your requirements. As you can see below, you can also define the node pool to autoscale based on current workloads. (Note: You may need to increase your GPU quota before running this command.)

  gcloud container node-pools create preemptible-gpu-pool \
    --cluster=<your-cluster-name> \
    --zone <your-cluster-zone> \
    --enable-autoscaling --max-nodes=4 --min-nodes=1 \
    --machine-type n1-highmem-8 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --scopes cloud-platform --verbosity error \
    --accelerator=type=nvidia-tesla-k80,count=4

You can also set up a node pool via the Cloud Console.

Defining a Kubeflow pipeline that uses the preemptible GKE nodes
When you’re defining a Kubeflow pipeline, you can indicate that a given step should run on a preemptible node by modifying the op like this:

  your_pipelines_op.apply(gcp.use_preemptible_nodepool())

See the documentation for details—if you changed the node taint from the above when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() call.

You’ll presumably also want to retry the step some number of times if the node is preempted. You can do this with the following command, which specifies five retries. This annotation also specifies that the op should run on a node with four GPUs available.

  your_pipelines_op.set_gpu_limit(4).apply(gcp.use_preemptible_nodepool()).set_retry(5)

An example: making model training cheaper with preemptible nodes
Let’s look at a concrete example. The Kubeflow pipeline from this codelab is a good candidate for using preemptible steps. This pipeline trains a Tensor2Tensor model on GitHub issue data, learns to predict issue titles from issue bodies, deploys the trained model for serving, and then deploys a webapp to get predictions from the model. It uses a TensorFlow model architecture that requires GPUs to achieve reasonable performance, and—depending upon the specific configuration—can run for quite a long time. 

We thought it would be useful to modify this pipeline to use the new support for preemptible VMs for the training. (Here is the original pipeline definition.) To do this, we first needed to refactor the pipeline, to separate a Google Cloud Storage (GCS in the code below) copy action from the model training activity. In the original version of this pipeline, a copy of an initial TensorFlow checkpoint was done as part of the training component actions. We needed to refactor this for correctness: if the training is preempted and needs to be restarted, we don’t want to re-copy, which would overwrite the current checkpoint files.

We also created reusable component specifications for the two Google Cloud Storage copy and TensorFlow training pipeline steps, rather than defining them as part of the pipeline definition. A reusable component is a pre-implemented, standalone component that is easy to add as a step in any pipeline, and makes the pipeline definition simpler. The component specification makes it easy to add static type checking of inputs and outputs. You can see these component definition files here and here.

Here is the relevant part of the new pipeline definition:

  import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.components as comp

...

copydata_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/amygdala/kubeflow-examples/preempt/github_issue_summarization/pipelines/components/t2t/datacopy_component.yaml'
  )

train_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/amygdala/kubeflow-examples/preempt/github_issue_summarization/pipelines/components/t2t/train_component.yaml'
  )

@dsl.pipeline(
  name='Github issue summarization',
  description='Demonstrate Tensor2Tensor-based training and TF-Serving'
)
def gh_summ(
  train_steps=2019300,
  project='YOUR_PROJECT_HERE',
  github_token='YOUR_GITHUB_TOKEN_HERE',
  working_dir='YOUR_GCS_DIR_HERE',
  checkpoint_dir='gs://aju-dev-demos-codelabs/kubecon/model_output_tbase.bak2019000',
  deploy_webapp='true',
  data_dir='gs://aju-dev-demos-codelabs/kubecon/t2t_data_gh_all/'
  ):

  copydata = copydata_op(
    working_dir=working_dir,
    data_dir=data_dir,
    checkpoint_dir=checkpoint_dir,
    model_dir='%s/%s/model_output' % (working_dir, '{{workflow.name}}'),
    action=COPY_ACTION
    ).apply(gcp.use_gcp_secret('user-gcp-sa'))

  train = train_op(
    working_dir=working_dir,
    data_dir=data_dir,
    checkpoint_dir=checkpoint_dir,
    model_dir='%s/%s/model_output' % (working_dir, '{{workflow.name}}'),
    action=TRAIN_ACTION, train_steps=train_steps,
    deploy_webapp=deploy_webapp
    ).apply(gcp.use_gcp_secret('user-gcp-sa'))
    
  train.after(copydata)
  train.set_gpu_limit(4).apply(gcp.use_preemptible_nodepool()).set_retry(5)
  train.set_memory_limit('48G')    
  
  ...

We defined the copydata and train steps using the component definitions, in this case loaded from URLs. (While not shown here, a Github-based component URL can include a specific git commit hash, thus supporting component version control—here’s an example.)

We’ve annotated the training op to run on a preemptible GPU-enabled node, and to retry five times—you would want to increase this number for a long training job. You can see the full pipeline definition here.

(Side note: Using preemptible VMs for the copy step would have also worked, since if it is interrupted, it can be rerun without changing the result.)

Preemptible pipelines in action
When the pipeline above is run, its training step may be preempted and restarted. If this happens, it will look like this in the Kubeflow Pipelines dashboard UI:

pipeline with a preemptible training step.png
A pipeline with a preemptible training step that has been restarted two times.

The restarted training step picks up where it left off, using its most recent saved checkpoint. In this screenshot, we’re using the Pipelines UI to look at the logs for the running training pod, train(2).

We can look at the logs for a pod terminated by node preemption using Stackdriver.
pod for a pipeline step.png
If the pod for a pipeline step has been deleted, a link is provided to look at the logs in Stackdriver.

Clicking the Stackdriver link opens a window that brings up the Stackdriver Log Viewer in the Cloud Console, and sets a filter that selects the output for that pod.

Stackdriver link in the Kubeflow Pipelines UI.png
Clicking the "Stackdriver’"link in the Kubeflow Pipelines UI takes you to the relevant logs in the Cloud Console.

At some later point, the training run completes—in the figure below, after three premptions and retries—and the remainder of the pipeline runs.

A Kubeflow Pipelines job.png
The full pipeline run, with training completed after three preemptions.

What’s next?
This post detailed how to reduce costs by using lower-priced preemptible VMs for your Kubeflow pipeline jobs. As we’ve seen, preemptible VMs are a great fit for regularly run model training or tuning jobs with flexible completion times, or for large-scale HP tuning explorations. To use preemptible VMs for a given pipeline, you only need to add an annotation to the relevant pipeline step(s). As we showed in our example, you can even define your node pools to autoscale so that you don’t pay for idle instances.

To learn more about Kubeflow, including Kubeflow Pipelines, and to try it out yourself, the Kubeflow documentation and examples repo are good starting points. You might also be interested in this recent Kubeflow Community meeting presentation on what's new in the Kubeflow 0.6 release.

It’s also worth trying this “lightweight” Kubeflow Pipelines deployer, which allows people who aren’t a GCP Project-level admin to deploy Kubeflow Pipelines onto an existing cluster. (Here are the details for setting up CloudSQL and Google Cloud Storage.)

In an upcoming post, we’ll show several different ways to deploy a Kubeflow pipeline remotely, from outside a Kubeflow cluster.