Google Cloud

Cloud Dataflow, Apache Beam and you

August 4, 2016

Rafael Fernandez

Technical Program Manager

Frances Perry

Software Engineer, Apache Beam committers

https://storage.googleapis.com/gweb-cloudblog-publish/images/dataflow-beam-you-1mzyv.max-100x100.PNG

https://storage.googleapis.com/gweb-cloudblog-publish/images/dataflow-beam-you-2ka0y.max-100x100.PNG

Apache Beam Google Cloud Dataflow

2016 has been a very exciting year for us on the Google Cloud Dataflow team. Six months ago, we announced our intention to donate the Cloud Dataflow programming model and SDKs to the Apache Software Foundation. That resulted in the incubating project Apache Beam, whose community includes both folks from Google, as well as many new friends. The Beam community has been hard at work refactoring the donated code, integrating new contributions and defining the mechanics of releases and such. You can catch a glimpse of progress made in this recent Apache Beam blog post.

Now, as we enter the second half of 2016, we want to share with you what the progress on Apache Beam means for Google Cloud Platform customers. The Cloud Dataflow service will continue to evolve and add unique processing features. We'll continue to invest in areas such as autoscaling, tailored UI experiences like graphical monitoring UI, and integrations with Cloud Platform monitoring and management services such as Stackdriver Monitoring. Our goal is to build the premier cloud service on which to run Apache Beam programs. To this effect, we'll distribute Apache Beam code as Cloud Dataflow SDKs — one for each supported language.

These ready-to-use distributions will package the portions of Apache Beam that are most useful for executing on Cloud Dataflow. In addition, they'll undergo additional testing and validation for use with the Cloud Dataflow service. As a Cloud Platform customer, you can expect our support channels to fully assist you with any needs you may have with the service and these distributions. The first such Cloud Dataflow distribution debuted last week with the beta of the Cloud Dataflow SDK for Python v.0.4.0. Not only is this the first publicly available Cloud Dataflow SDK for Python, but it's also the first one to come directly from Apache Beam — as evidenced by the now ubiquitous "import apache_beam as beam" statement :)

We're hard at work preparing the Beam redistribution for Java, which was our first SDK. We expect to version-bump the SDK for Java to 2.x to coincide with its first redistribution of Beam later this year. Until then, you're advised to use the 1.x releases against the service. The transition from 1.x to 2.x will include simple changes (such as package names changing from "com.google.cloud.dataflow" to "org.apache.beam") and should be a straightforward transition. Expect us to share more details later this year.

We're very excited about the joint future of Apache Beam and Cloud Dataflow, and can't wait to finish our transition project. In the end, you can have the best of both worlds: a robust, community-owned programming model for today's big data processing needs and a strong cloud service on which to run the resulting pipelines.

Posted in