Jump to Content
Data Analytics

The next generation of Dataflow: Dataflow Prime, Dataflow Go, and Dataflow ML

July 20, 2022
Sachin Agarwal

Group Product Manager, Google Cloud

Frank Guan

Product Marketing Lead, Google Cloud

By the end of 2024, 75% of enterprises will shift from piloting to operationalizing artificial intelligence according to IDC, yet the growing complexity of data types, heterogeneous data stacks and programming languages make this a challenge for all data engineers. With the current economic climate, doing more with cheaper costs and higher efficiency have also become a key consideration for many organizations.

Today, we are pleased to announce three major releases that bring the power of Google Cloud’s Dataflow to more developers for expanded use cases and higher data processing workloads, while keeping the costs low, as part of our goal to democratize the power of big data, real time streaming, and ML/AI for all developers, everywhere.

The three big Dataflow releases we’re thrilled to announce in general availability are:

  • Dataflow Prime - Dataflow Prime takes the serverless, no-operation benefits of Dataflow to a totally new level.  Dataflow Prime allows users to take advantage of both horizontal autoscaling (more machines) and vertical autoscaling (larger machines with more memory) automatically for your streaming data processing workloads, with batch coming in the near future.  With Dataflow Prime, pipelines are more efficient, enabling you to apply the insights in real time.  

  • Dataflow Go  - Dataflow Go provides native support for Go, a rapidly growing programming language thanks to its flexibility, ease of use and differentiated concepts, for both batch and streaming data processing workloads. With Apache Beam’s unique multi-language model, Dataflow Go pipelines can leverage the well adopted, best-in-class performance provided by the wide range of Java I/O connectors with ML transforms and I/O connectors from Python coming soon.  

  • Dataflow ML - Speaking of ML transforms, Dataflow now has added out of the box support for running PyTorch and scikit-learn models directly within the pipeline. The new RunInference transform enables simplicity by allowing models to be used in production pipelines with very little code. These features are in addition to Dataflow's existing ML capabilities such as GPU support and the pre and post processing system for ML training, either directly or via frameworks such as Tensorflow Extended (TFX).

We’re so excited to make Dataflow even better.  With the world’s only truly unified batch and streaming data processing model provided by Apache Beam, the wide support for ML frameworks, and the unique cross-language capabilities of the Beam model, Dataflow is becoming ever easier, faster, and more accessible for all data processing needs.

Getting started

Interested in running a proof of concept using your own data? Talk to your Google Cloud sales contact for hands-on workshop opportunities or sign up here.

Posted in