Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Cloud Dataflow, Apache Beam and you

Thursday, August 4, 2016
Posted by Rafael Fernández, Technical Program Manager and Frances Perry, Software Engineer
Apache Beam Google Cloud Dataflow

2016 has been a very exciting year for us on the Google Cloud Dataflow team. Six months ago, we announced our intention to donate the Cloud Dataflow programming model and SDKs to the Apache Software Foundation. That resulted in the incubating project Apache Beam, whose community includes both folks from Google, as well as many new friends. The Beam community has been hard at work refactoring the donated code, integrating new contributions and defining the mechanics of releases and such. You can catch a glimpse of progress made in this recent Apache Beam blog post.

Now, as we enter the second half of 2016, we want to share with you what the progress on Apache Beam means for Google Cloud Platform customers. The Cloud Dataflow service will continue to evolve and add unique processing features. We'll continue to invest in areas such as autoscaling, tailored UI experiences like graphical monitoring UI, and integrations with Cloud Platform monitoring and management services such as Stackdriver Monitoring. Our goal is to build the premier cloud service on which to run Apache Beam programs. To this effect, we'll distribute Apache Beam code as Cloud Dataflow SDKs ​— one for each supported language.

These ready-to-use distributions will package the portions of Apache Beam that are most useful for executing on Cloud Dataflow. In addition, they'll undergo additional testing and validation for use with the Cloud Dataflow service. As a Cloud Platform customer, you can expect our support channels to fully assist you with any needs you may have with the service and these distributions. The first such Cloud Dataflow distribution debuted last week with the beta of the Cloud Dataflow SDK for Python v.0.4.0. Not only is this the first publicly available Cloud Dataflow SDK for Python, but it's also the first one to come directly from Apache Beam —​ as evidenced by the now ubiquitous "import apache_beam as beam" statement :)

We're hard at work preparing the Beam redistribution for Java, which was our first SDK. We expect to version-bump the SDK for Java to 2.x to coincide with its first redistribution of Beam later this year. Until then, you're advised to use the 1.x releases against the service. The transition from 1.x to 2.x will include simple changes (such as package names changing from "com.google.cloud.dataflow" to "org.apache.beam") and should be a straightforward transition. Expect us to share more details later this year.

We're very excited about the joint future of Apache Beam and Cloud Dataflow, and can't wait to finish our transition project. In the end, you can have the best of both worlds: a robust, community-owned programming model for today's big data processing needs and a strong cloud service on which to run the resulting pipelines.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

60 Day FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 60 days.

TRY IT FREE