Explore Dataflow ML notebooks to integrate machine learning into your Apache Beam pipelines. These notebooks provide practical examples and guidance for common machine learning workflows.
Use these resources to:
- Preprocess data for ML models: This includes tasks like scaling data,
computing vocabularies, and using
MLTransform
for data preparation. - Run inference with various models and frameworks: Use the
RunInference
transform with PyTorch, TensorFlow, scikit-learn, Hugging Face models, Gemma models, and Vertex AI, including on GPUs with vLLM. - Generate and manage embeddings: Create text embeddings using Vertex AI or Hugging Face, and ingest them into databases like AlloyDB and BigQuery for vector search.
- Implement advanced ML pipeline patterns: This includes automatic model refresh in running pipelines, using multiple models, building ensemble models, and data enrichment using BigQuery, Bigtable, and Vertex AI Feature Store.
- Apply ML to specific use cases: Examples include anomaly detection as well as sentiment analysis and summarization with Gemma.
All tutorials
Core Dataflow & MLTransform Concepts
|
Preprocessing with the Apache Beam DataFrames API Demonstrates the use of the Apache Beam DataFrames API to perform common data exploration and preprocessing steps. |
View Notebook |
Core Dataflow & MLTransform Concepts
|
Preprocess data with MLTransform A basic introduction to using MLTransform for preprocessing data for machine learning workflows. |
View Notebook |
Data enrichment & Embedding
|
Vector Embedding Ingestion with Apache Beam and AlloyDB Demonstrates how to generate embeddings from data and ingest them into AlloyDB using Apache Beam and Dataflow for scalable data processing. |
View Notebook |
Data enrichment & Embedding
|
Use Apache Beam and BigQuery to enrich data Shows how to enrich data by using the Apache Beam enrichment transform with BigQuery. |
View Notebook |
Data enrichment & Embedding
|
Embedding Ingestion and Vector Search with Apache Beam and BigQuery Demonstrates how to use the Apache Beam RAG package to generate embeddings, ingest them into BigQuery, and perform vector similarity search. |
View Notebook |
Data enrichment & Embedding
|
Use Apache Beam and Bigtable to enrich data Shows how to enrich data by using the Apache Beam enrichment transform with Bigtable. |
View Notebook |
Data enrichment & Embedding
|
Generate text embeddings by using Hugging Face Hub models Uses MLTransform to generate embeddings from text data using Hugging Face's SentenceTransformers framework. |
View Notebook |
Data enrichment & Embedding
|
Use Apache Beam and Vertex AI Feature Store to enrich data Shows how to enrich data by using the Apache Beam enrichment transform with Vertex AI Feature Store. |
View Notebook |
Data enrichment & Embedding
|
Generate text embeddings by using the Vertex AI API Uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. |
View Notebook |
Model training & Data processing
|
Update ML models in running pipelines Demonstrates how to perform automatic model updates without stopping your Apache Beam pipeline by using side inputs. |
View Notebook |
Model training & Data processing
|
Compute and apply vocabulary on a dataset Shows how to use MLTransform to generate a vocabulary on input text and assign an index value to each token. |
View Notebook |
Model training & Data processing
|
Run ML inference with multiple differently-trained models Demonstrates how to use a KeyedModelHandler to run inference in an Apache Beam pipeline with multiple different models on a per-key basis. |
View Notebook |
Model training & Data processing
|
Use MLTransform to scale data Shows how to use MLTransform to scale data, an important preprocessing step for training machine learning (ML) models. |
View Notebook |
Model training & Data processing
|
TensorFlow Model Analysis in Beam Shows how you can use TFMA to investigate and visualize the performance of a model as part of your Apache Beam pipeline by creating and comparing two models. |
View Notebook |
Run inference
|
Remote inference in Apache Beam Demonstrates how to implement a custom inference call in Apache Beam by using the Google Cloud Vision API. |
View Notebook |
Run inference
|
Bring your own ML model to Beam RunInference Illustrates how to use the spaCy package to load a machine learning (ML) model and perform inference in an Apache Beam pipeline using the RunInference PTransform. |
View Notebook |
Run inference
|
Run inference with a Gemma open model Demonstrates how to load the preconfigured Gemma 2B model and then use it in an Apache Beam inference pipeline. |
View Notebook |
Run inference
|
Use RunInference for Generative AI Shows how to use the Apache Beam RunInference transform for generative AI tasks with a large language model (LLM) from the Hugging Face Model Hub. |
View Notebook |
Run inference
|
Apache Beam RunInference with Hugging Face Shows how to use models from Hugging Face and Hugging Face pipeline in Apache Beam pipelines that uses the RunInference transform. |
View Notebook |
Run inference
|
Ensemble model using an image captioning and ranking example Shows how to implement a cascade model in Apache Beam using the RunInference API for image captioning. |
View Notebook |
Run inference
|
Apache Beam RunInference for PyTorch Demonstrates the use of the RunInference transform for PyTorch. |
View Notebook |
Run inference
|
Use RunInference in Apache Beam Demonstrates how to use the RunInference API with three popular ML frameworks: PyTorch, TensorFlow, and scikit-learn. |
View Notebook |
Run inference
|
Apache Beam RunInference for scikit-learn Demonstrates the use of the RunInference transform for scikit-learn. |
View Notebook |
Run inference
|
Apache Beam RunInference with TensorFlow Shows how to use the Apache Beam RunInference transform for TensorFlow. |
View Notebook |
Run inference
|
Use RunInference with TFX Basic Shared Libraries Demonstrates how to use the Apache Beam RunInference transform with TensorFlow and TFX Basic Shared Libraries (tfx-bsl). |
View Notebook |
Run inference
|
Apache Beam RunInference with TensorFlow and TensorFlow Hub Shows how to use the Apache Beam RunInference transform for TensorFlow with a trained model from TensorFlow Hub. |
View Notebook |
Run inference
|
Apache Beam RunInference with Vertex AI Shows how to use the Apache Beam RunInference transform for image classification with Vertex AI. |
View Notebook |
Run inference
|
Run ML inference by using vLLM on GPUs Demonstrates how to run machine learning inference by using vLLM and GPUs. |
View Notebook |
Run inference
|
Use TPUs in Dataflow Demonstrates how to configure and execute two distinct Dataflow pipelines that leverage Tensor Processing Units (TPUs). The first pipeline performs a simple computation to confirm TPU access, while the second, more complex pipeline runs inference with the Gemma-3-27b-it model. |
View Notebook |
Specialized use cases
|
Anomaly Detection on Batch and Streaming Data using Apache Beam (Z-Score Method) Shows how to perform anomaly detection on both batch and streaming data using the AnomalyDetection PTransform with the Z-Score algorithm. |
View Notebook |
Specialized use cases
|
Use Gemma to gauge sentiment and summarize conversations Demonstrates how to use Gemma to gauge the sentiment of a conversation, summarize the conversation's content, and draft a reply. |
View Notebook |