AI & Machine Learning

With Kubeflow 1.0, run ML workflows on Anthos across environments

March 2, 2020

Jeremy Lewi

Software Engineer

Abhishek Gupta

Engineering Manager

Google started the open-source Kubeflow Project with the goal of making Kubernetes the best way to run machine learning (ML) workloads in production. Today, Kubeflow 1.0 was released.

Kubeflow helps companies standardize on a common infrastructure across software development and machine learning, leveraging open-source data science and cloud-native ecosystems for every step of the machine learning lifecycle. With the support of a robust contributor community, Kubeflow provides a Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable ML workloads.

Using Kubeflow on Google Cloud's Anthos platform lets teams run these machine-learning workflows in hybrid and multi-cloud environments while taking advantage of Google Kubernetes Engine's (GKE) enterprise-grade security, autoscaling, logging, and identity features.

Barton Rhodes, Senior Machine Learning Engineer at DaVita, and an early user of Kubeflow on Anthos, said the enterprise features introduced in Kubeflow 1.0 will make a big difference for his organization:

"Having used Kubeflow (since 0.1) as a development foundation for a platform of several teams of data scientists needing to operate in hybrid-cloud environments, it has been a pleasure and an inspiration to see the project mature. When so much of the ethics and impacts of machine learning come down to the details of implementation, operations, safety, and reproducibility for the resulting artifacts, open source allows the broader community to build and tackle these challenges on top of shared foundations. With this release and exciting new features like multi-user isolation, workload identity, and KFServing, it is that much easier to introduce Kubeflow or its individual resources into the enterprise."

The blog post introducing Kubeflow 1.0 provides a technical deep-dive into the core set of applications included in the open-source release. In this post, we’ll look at more details on the advantages of using Kubeflow 1.0 on Anthos for the enterprise.

Security

For data scientists to be productive, they need easy and secure access to UIs like the Kubeflow dashboard, Jupyter UI, and TensorBoard.

When you deploy Kubeflow on Anthos, it can be secured using Identity-Aware Proxy (IAP), Google Cloud's zero trust access solution (also known as BeyondCorp). Using IAP, you can restrict access to Kubeflow based on either IP (e.g. to your corporate network), device attributes (e.g. to ensure Kubeflow is only accessed from up-to-date devices), or both.

Autoscaling

When deployed on Anthos, Kubeflow takes advantage of GKE autoscaling and node auto-provisioning to right-size your clusters based on your workloads. If the existing node pools have insufficient resources to schedule pending workloads, node auto-provisioning will automatically create new ones. For example, node auto-provisioning will automatically add a GPU node pool when a user requests a GPU. Autoscaling can also add more VMs to existing node pools if there’s insufficient capacity to schedule pending pods.

Logging

GKE has direct integration with Cloud Logging, ensuring that the logs from all of your workloads are preserved and easily searchable. As this MNIST example shows, by using a query like the one below, you can fetch the logs for one of the pods in a distributed TensorFlow job by filtering based on the pod label.

Cloud Logging's integration with BigQuery makes it easier to begin collecting the metrics you need to evaluate performance. If your application emits logs as JSON entries, they will be indexed and searchable in Python. You can then leverage Cloud Logging's export functionality to export them to Cloud Storage or BigQuery to facilitate analysis.

Combining BigQuery logging with Kubeflow notebooks can help you analyze model performance. This GitHub notebook illustrates how the Kubeflow project is using this combination to measure the performance of models that automatically classify Kubeflow issues. Using pandas-gbq we can more easily generate Pandas Dataframes based on SQL queries, then analyze and plot results in our notebooks. Below is a snippet illustrating how you can log predictions from python.

Here we’re using Python's standard logging module with a custom formatter to emit the logs as serialized JSON. The structure is preserved when the logs are ingested into Cloud Logging and then exported to BigQuery, and we can search based on the extra fields that are provided.

Workload Identity

On Anthos, Kubeflow uses Workload Identity to help seamlessly integrate your AI workloads running on GKE with Google Cloud services. When you create a Kubeflow namespace using Kubeflow's profile controller, you can select a Google Cloud service account to bind to Kubernetes service accounts in the resulting namespace. You can then run pods using those Kubernetes service accounts to access Google Cloud services like Cloud Storage and BigQuery without requiring additional credentials.

The MNIST example mentioned above relies on workload identity to let your Jupyter notebooks, TFJobs, and Kaniko Jobs talk to Cloud Storage.

What's next

Kubeflow 1.0 is just the beginning. We’re working on additional features that will help you be more secure and efficient. Here's what you can look forward to in upcoming releases:

Support for running ML workloads on-prem using Anthos
Using Katib and Batch on GKE to run large-scale hyperparameter tuning jobs
A solution for preventing data exfiltration by deploying Kubeflow with private GKE and VPC Service Controls

If you want to hear more about this release, check out the Kubeflow 1.0 interview on the Kubernetes Podcast from Google.

Get started

To get started with Kubeflow on Anthos, check out this tutorial. It walks through every step you need to deploy Kubeflow on Anthos GKE and then run MNIST E2E.

Posted in