Understanding Kubeflow pipelines and components
This document describes the concepts required to understand the machine learning (ML) pipelines and components that are available on AI Hub.
About Kubeflow and the Kubeflow Pipelines platform
Kubeflow is an open source toolkit for running ML workloads on Kubernetes. Kubeflow Pipelines is a component of Kubeflow that provides a platform for building and deploying ML workflows, called pipelines. Pipelines are built from self-contained sets of code called pipeline components.
Understanding pipeline components
The pipeline components on AI Hub are self-contained sets of code that perform one step in the pipeline's workflow, such as data preprocessing, data transformation, model training, and so on. You can use the components on AI Hub to build new a pipeline that meets the requirements of your AI system.
Components are composed of a set of input parameters, a set of outputs, and the location of a container image. A component's container image is a package that includes the component's executable code and a definition of the environment that the code runs in.
- To learn more about building pipelines with the components on AI Hub, see the AI Hub guide to using pipeline components.
- Learn more about building custom pipeline components in the Kubeflow documentation.
Understanding pipelines
The pipelines on AI Hub are portable, scalable end-to-end ML workflows, based on containers. You can reuse the pipelines shared on AI Hub in your AI system, or you can build a custom pipeline to meet your system's requirements.
A Kubeflow pipeline is composed of a set of input parameters and a set of tasks. You can modify a pipeline's input parameters within the Kubeflow Pipelines user interface to:
- Experiment with different sets of hyperparameters, or
- Reuse a pipeline's workflow to train a new model.
A task is an instance of a component that performs a step in the pipeline's workflow. Since tasks are instances of components, tasks have input parameters, outputs, and a container image. Task input parameters can be set from the pipeline's input parameters or set to depend on the output of other tasks within this pipeline. These dependencies are used by the Kubeflow Pipelines SDK to define the pipeline's workflow as a graph.
For example, consider a pipeline with the following tasks:
- Preprocess: This task prepares the training data.
- Train: This task uses the preprocessed training data to train the model.
- Predict: This task deploys the trained model as an ML service and gets predictions for the testing dataset.
- Confusion matrix: This task uses the output of the prediction task to build a confusion matrix.
- ROC: This task uses the output of the prediction task to perform ROC analysis.
In order to create the workflow graph, the Kubeflow Pipelines SDK analyzes the task dependencies.
- The preprocessing task does not depend on any other tasks, so it can be the first task in the workflow or it can run concurrently with other tasks.
- The training task relies on data produced by the preprocessing task, so training must occur after preprocessing.
- The prediction task relies on the trained model produced by the training task, so prediction must occur after training.
- Building the confusion matrix and performing ROC analysis both rely on the output of the prediction task, so they must occur after prediction is complete. Building the confusion matrix and performing ROC analysis can occur concurrently since they both depend on the output of the prediction task.
Based on this analysis, the Kubeflow Pipelines system will run the preprocessing, training, and prediction tasks sequentially, and then run the confusion matrix and ROC tasks concurrently.
- Learn how to deploy a pipeline from AI Hub.
- Learn more about building Kubeflow pipelines in the Kubeflow Pipelines documentation.
About component container images
The definition for a pipeline component includes the location of a container image in a registry. Container images are packages that include the component's executable code and a definition of the environment that the code runs in. The container image registry stores, delivers, and controls access to container images.
Publicly shared pipelines and components on AI Hub store their container images in public registries. Anyone can reuse the public pipelines and components that are available on AI Hub.
Pipelines and components that are privately shared with you may rely on container images that have additional authentication and authorization requirements. Read the asset details page on AI Hub to learn more about the access requirements for those pipelines and components.
Learn more about setting up a shared Container Registry to make it easier to share pipelines and components within your organization.
What's next
- Learn how to deploy pipelines or reuse components from AI Hub.
- Learn how to share pipelines and components within your organization on AI Hub.