About Dataflow ML

You can use Dataflow ML's scale data processing abilities for prediction and inference pipelines and for data preparation for training.

Diagram of the Dataflow ML workflow.

Figure 1. The complete Dataflow ML workflow.

Requirements and limitations

  • Dataflow ML supports batch and streaming pipelines.
  • The RunInference API is supported in Apache Beam 2.40.0 and later versions.
  • The MLTransform API is supported in Apache Beam 2.53.0 and later versions.
  • Model handlers are available for PyTorch, scikit-learn, TensorFlow, ONNX, and TensorRT. For unsupported frameworks, you can use a custom model handler.

Data preparation for training

Prediction and inference pipelines

Dataflow ML combines the power of Dataflow with Apache Beam's RunInference API. With the RunInference API, you define the model's characteristics and properties and pass that configuration to the RunInference transform. This feature allows users to run the model within their Dataflow pipelines without needing to know the model's implementation details. You can choose the framework that best suits your data, such as TensorFlow and PyTorch.

Run multiple models in a pipeline

Use the RunInference transform to add multiple inference models to your Dataflow pipeline. For more information, including code details, see Multi-model pipelines in the Apache Beam documentation.

Use GPUs with Dataflow

For batch or streaming pipelines that require the use of accelerators, you can run Dataflow pipelines on NVIDIA GPU devices. For more information, see Run a Dataflow pipeline with GPUs.

Troubleshoot Dataflow ML

This section provides troubleshooting strategies and links that you might find helpful when using Dataflow ML.

Stack expects each tensor to be equal size

If you provide images of different sizes or word embeddings of different lengths when using the RunInference API, the following error might occur:

File "/beam/sdks/python/apache_beam/ml/inference/pytorch_inference.py", line 232, in run_inference batched_tensors = torch.stack(key_to_tensor_list[key]) RuntimeError: stack expects each tensor to be equal size, but got [12] at entry 0 and [10] at entry 1 [while running 'PyTorchRunInference/ParDo(_RunInferenceDoFn)']

This error occurs because the RunInference API can't batch tensor elements of different sizes. For workarounds, see Unable to batch tensor elements in the Apache Beam documentation.

What's next