Dataflow ML
Dataflow ML lets you use Dataflow to deploy and manage complete machine learning (ML) pipelines. Use ML models to do local and remote inference with batch and streaming pipelines. Use data processing tools to prepare your data for model training and to process the results of the models.
Prediction and inference
Whether you want to classify images in real-time, run remote inference calls, or build a custom model handler, you can find complete Dataflow ML examples.
Data processing
Use the
MLTransform
class to preprocess data for machine learning (ML) workflows. By combining multiple data processing transforms in one class, MLTransform
streamlines the process of applying Apache Beam ML data processing transforms to your workflow.
with pipeline as p: predictions = ( p | beam.ReadFromSource('a_source') | RunInference(MODEL_HANDLER))
RunInference transform
Using
RunInference
is as straightforward as adding the transform code to your pipeline. In this example, MODEL_HANDLER is the model configuration object.
with beam.Pipeline() as p: transformed_data = ( p | beam.Create(data) | MLTransform(...) | beam.Map(print))
MLTransform code
To prepare your data for training ML models, use
MLTransform
in your pipeline. MLTransform
wraps multiple data processing transforms in one class, letting you use one class for a variety of preprocessing tasks.
Prediction and inference with pre-trained models
I have a Pytorch model
Use a pre-trained model with Pytorch.
I have a scikit-learn model
Use a pre-trained model with scikit-learn.
I have a TensorFlow model
Use a pre-trained model with TensorFlow.
I have a Vertex AI model handler
Apache Beam has built-in support for sending requests to a remotely deployed Vertex AI endpoint. This notebook shows how to use the Apache Beam
RunInference
transform for image classification with Vertex AI.
I want to use multiple differently-trained models in my pipeline
Use the
RunInference
transform with a keyed model handler to use multiple models in the same RunInference
transform.
Data processing with MLTransform
I want to generate text embeddings for my LLM by using Vertex AI
Use Apache Beam's
MLTransform
class with the Vertex AI text-embeddings API to generate text embeddings. Text embeddings are a way to represent text as numerical vectors, which is necessary for many natural language processing (NLP) tasks.
I want to generate text embeddings for my LLM by using Hugging Face
Use Apache Beam's
MLTransform
class with Hugging Face Hub models to generate text embeddings. Hugging Face's SentenceTransformers
framework uses Python to generate sentence, text, and image embeddings.
I want to compute a vocabulary from a dataset
Compute a unique vocabulary from a dataset and then map each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning tasks.
I want to scale my data to train my ML model
Scale your data so that you can use it to train your ML model. Apache Beam's
MLTransform
class includes multiple data scaling transforms.
Prediction and inference with hub models
I want to do sentiment analysis and summarization
You can use Gemma models in your inference pipelines to gauge the sentiment of a conversation, summarize that conversation's content, and draft a reply for a difficult conversation.
I have a trained model from Hugging Face
Use the
RunInference
transform with a trained model from Hugging Face.
I have a trained model from TensorFlow Hub
Use the
RunInference
transform for TensorFlow with a trained model from TensorFlow Hub.
I want to do generative AI
Use the
RunInference
transform for generative AI tasks. This notebook uses a language model from the Hugging Face Model Hub.
ML workflow orchestration
I want to use Dataflow with Vertex AI Pipelines
Vertex AI Pipelines helps you to automate, monitor, and govern your ML systems by orchestrating your ML workflows in a serverless manner. Use Vertex AI Pipelines to orchestrate workflow DAGs defined by either TFX or KFP and to automatically track your ML artifacts using Vertex ML Metadata.
I want to use Dataflow with TFX
TensorFlow Extended (TFX) lets you deploy complete ML pipelines by using an orchestration framework that has a built-in integration with Apache Beam and the Dataflow runner.
I want to use Dataflow with KFP
Kubeflow makes deployments of ML workflows on Kubernetes simple, portable, and scalable. Kubeflow Pipelines are reusable complete ML workflows built using the Kubeflow Pipelines SDK.
Additional features
Use GPUs
Using GPUs in Dataflow jobs can accelerate image processing and machine learning processing tasks. See GPU types supported by Dataflow and recommendations for which type of GPU to use for different workloads.
Mix and match CPUs and GPUs with right fitting
Mix and match GPUs and CPUs for high performance and lower cost. The ability to target resources to specific pipeline steps provides additional pipeline flexibility and capability, and potential cost savings.
Enrich streaming pipelines with feature store data
Apache Beam simplifies the data enrichment workflow by providing a turnkey enrichment transform that you can add to your pipeline.
Model maintenance and evaluation
Automatic model refresh
RunInference
lets you perform automatic model updates without stopping your Apache Beam pipeline. Use side inputs to update your model in real time, even while the pipeline is running.
Evaluate your models
Use TensorFlow Model Analysis (TFMA) to investigate and visualize the performance of a model by creating and comparing two models. With Apache Beam, you can evaluate and compare multiple models in one step.
Resources
Run cross-language pipelines
To use RunInference with a Java pipeline, create a cross-language Python transform. The pipeline calls the transform, which does the preprocessing, postprocessing, and inference.
Dataflow permissions
To run the Dataflow ML examples, you might need to configure your Google Cloud permissions. Read a detailed guide about the required permissions for Dataflow pipelines.
View the examples on GitHub
The examples and the corresponding source code are available on GitHub. In GitHub, you can also find instructions for running the examples in Colab.