Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Apache Beam graduates from incubation: Try it today on Google Cloud Dataflow

Tuesday, January 10, 2017

By Frances Perry, Software Engineering Lead at Google and Apache Beam PMC Member

Apache Beam is now a Top Level Project at the Apache Software Foundation, and its future looks brighter than ever — including for Google Cloud Dataflow users

Today, the Apache Software Foundation announced that Apache Beam has successfully graduated from incubation, becoming a Top Level Project following the community-driven development processes of the foundation. Congratulations, Apache Beam!

Apache Beam’s roots come from Google Cloud Dataflow, the fully managed service for executing both batch and streaming data processing pipelines that powers mission critical processes for companies like Spotify, Citi, Qubit and Google itself. Its novel programming model enables users to write unified, efficient and portable data processing pipelines.

Last January, Google and partners from Cloudera, data Artisans, and Talend proposed the creation of a new project to generalize and extend the Dataflow programming model. The resulting Apache Beam project spent the last year in incubation with the Apache Software Foundation, building a vibrant and welcoming open source community, with a number of new contributors joining the original developers. Today’s announcement is a recognition of a sustainable community, with the potential to grow this technology more than any single company could alone.

As announced previously, the Cloud Dataflow SDKs will be based on Apache Beam going forward. This means that Cloud Dataflow users will continue to use the same intuitive programming model for expressing batch and streaming computations and get the same no-knobs, performant Cloud Dataflow runtime that’s tightly integrated with the rest of Google Cloud Platform (GCP). In addition, users will benefit from the portability provided by Apache Beam, allowing them to easily move the same data processing pipeline onto any supported runtime environment, including but not limited to on-premise Apache Spark clusters, Apache Flink running in the cloud, and Cloud Dataflow on GCP.

On that note, we’re beaming with joy to announce the availability of the first Beam-based Dataflow SDK for Java, version 2.0.0-beta1. Now you can use both Java and Python SDKs to run Beam-based pipelines with Beta support on the Cloud Dataflow service. Dataflow SDK for Java 1.9.0 continues to be the recommended version for production use, but to get a taste of the Beam-based future, as well as the usual new features and improvements, try out 2.0.0-beta1 and run your pipeline on the Cloud Dataflow service, as well as anywhere else you’d like.

We’re excited to be involved in the future of Apache Beam and its ecosystem, while continuing to ensure that Cloud Dataflow is the premier place to run Beam pipelines.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.