Dataflow ML

Dataflow ML lets you use Dataflow to deploy and manage complete machine learning (ML) pipelines. Use ML models to do local and remote inference with batch and streaming pipelines. Use data processing tools to prepare your data for model training and to process the results of the models.

About Dataflow ML

Prediction and inference

Whether you want to classify images in real-time, run remote inference calls, or build a custom model handler, you can find complete Dataflow ML examples.

See use cases

Data processing

Use the MLTransform class to preprocess data for machine learning (ML) workflows. By combining multiple data processing transforms in one class, MLTransform streamlines the process of applying Apache Beam ML data processing transforms to your workflow.

Learn more

with pipeline as p:
  predictions = (
      p
      | beam.ReadFromSource('a_source')
      | RunInference(MODEL_HANDLER))

RunInference transform

Using RunInference is as straightforward as adding the transform code to your pipeline. In this example, MODEL_HANDLER is the model configuration object.

with beam.Pipeline() as p:
  transformed_data = (
    p
    | beam.Create(data)
    | MLTransform(...)
    | beam.Map(print))

MLTransform code

To prepare your data for training ML models, use MLTransform in your pipeline. MLTransform wraps multiple data processing transforms in one class, letting you use one class for a variety of preprocessing tasks.

Prediction and inference with pre-trained models

I have a Pytorch model

Use a pre-trained model with Pytorch.

I have a scikit-learn model

Use a pre-trained model with scikit-learn.

I have a TensorFlow model

Use a pre-trained model with TensorFlow.

I have a Vertex AI model handler

Apache Beam has built-in support for sending requests to a remotely deployed Vertex AI endpoint. This notebook shows how to use the Apache Beam RunInference transform for image classification with Vertex AI.

I want to use multiple differently-trained models in my pipeline

Use the RunInference transform with a keyed model handler to use multiple models in the same RunInference transform.

Data processing with MLTransform

I want to generate text embeddings for my LLM by using Vertex AI

Use Apache Beam's MLTransform class with the Vertex AI text-embeddings API to generate text embeddings. Text embeddings are a way to represent text as numerical vectors, which is necessary for many natural language processing (NLP) tasks.

I want to generate text embeddings for my LLM by using Hugging Face

Use Apache Beam's MLTransform class with Hugging Face Hub models to generate text embeddings. Hugging Face's SentenceTransformers framework uses Python to generate sentence, text, and image embeddings.

I want to compute a vocabulary from a dataset

Compute a unique vocabulary from a dataset and then map each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning tasks.

I want to scale my data to train my ML model

Scale your data so that you can use it to train your ML model. Apache Beam's MLTransform class includes multiple data scaling transforms.

Prediction and inference with hub models

I want to do sentiment analysis and summarization

You can use Gemma models in your inference pipelines to gauge the sentiment of a conversation, summarize that conversation's content, and draft a reply for a difficult conversation.

I have a trained model from Hugging Face

Use the RunInference transform with a trained model from Hugging Face.

I have a trained model from TensorFlow Hub

Use the RunInference transform for TensorFlow with a trained model from TensorFlow Hub.

I want to do generative AI

Use the RunInference transform for generative AI tasks. This notebook uses a language model from the Hugging Face Model Hub.

ML workflow orchestration

I want to use Dataflow with Vertex AI Pipelines

Vertex AI Pipelines helps you to automate, monitor, and govern your ML systems by orchestrating your ML workflows in a serverless manner. Use Vertex AI Pipelines to orchestrate workflow DAGs defined by either TFX or KFP and to automatically track your ML artifacts using Vertex ML Metadata.

I want to use Dataflow with TFX

TensorFlow Extended (TFX) lets you deploy complete ML pipelines by using an orchestration framework that has a built-in integration with Apache Beam and the Dataflow runner.

I want to use Dataflow with KFP

Kubeflow makes deployments of ML workflows on Kubernetes simple, portable, and scalable. Kubeflow Pipelines are reusable complete ML workflows built using the Kubeflow Pipelines SDK.

Additional features

Use GPUs

Using GPUs in Dataflow jobs can accelerate image processing and machine learning processing tasks. See GPU types supported by Dataflow and recommendations for which type of GPU to use for different workloads.

Learn about using GPUs

Mix and match CPUs and GPUs with right fitting

Mix and match GPUs and CPUs for high performance and lower cost. The ability to target resources to specific pipeline steps provides additional pipeline flexibility and capability, and potential cost savings.

Enable right fitting

Enrich streaming pipelines with feature store data

Apache Beam simplifies the data enrichment workflow by providing a turnkey enrichment transform that you can add to your pipeline.