Google Cloud Platform

Cloud Dataflow 2.0 SDK goes GA

Learn what “Beam-first” design means for the new Cloud Dataflow 2.0 SDKs for Java and Python.

Cloud Dataflow is the fully-managed data processing service on Google Cloud Platform (GCP), supporting both stream and batch execution of pipelines. With that goal in mind, the Cloud Dataflow team is pleased to announce the first stable releases of the 2.0 Python and Java SDKs, based on Apache Beam 2.0. These releases represent a major milestone in the development of Beam, as they're built on a "Beam-first" design that emphasizes portability. (In January 2016, Google donated the Cloud Dataflow SDKs to the Apache Software Foundation as part of the Apache Beam project, and in January 2017, Beam became a Top Level Project.)

We're excited that this release brings a number of community-contributed Beam connectors that can be run on Cloud Dataflow Java, including Java Message Service (JMS), Java Database Connectivity (JDBC), MongoDB and Amazon Kinesis. For existing connectors, there are a number of usability improvements, including better handling of large BigQuery Sinks, the ability to write streaming data to text or Apache Avro files on Cloud Storage, allowing writing into multiple BigQuery tables based on incoming user data and more.

A notable feature of this release is the presence of the State API for Java, which provides richer abstractions for interacting with per-key aggregated state. Some key use cases for this feature include:

  • Exact batching of N elements per aggregate
  • Per-user state machines (e.g., for account verification)
  • Outputting only when a value changes
The release marks the first release of the Python SDK with API stability, and includes all existing connectors to BigQuery, Cloud Datastore, TensorFlow TFRecord and Avro format. For complete list of changes, see the Java 2.0, and Python release notes.

We've gotten great feedback from our customers so far:

[I]t’s a straightforward affair — swap out the SDKs, fix up some imports and compile errors, and finally update your runners and program arguments. That’s it. Even my manager could do it, seriously.Shine Technologies
Reading and writing Cloud Pub/Sub message attributes, dynamically redirecting the output to different BigQuery destinations depending on data content and integrating the local debug runner with streaming data sources are all examples of features that make Cloud Dataflow 2.0 ever so useful to our multi-tenant enterprise applications.Mingjian Song
Software Architect, JDA Software

Please note that we're not announcing additional deprecations of 1.X SDK at this time. We understand that it takes time to upgrade, and look forward to working closely with the community to ensure that 2.0 is an excellent experience for all of our customers. To try out the new SDKs, please check out the Cloud Dataflow quickstarts.