The universe of people and companies that can easily process data at scale is larger today, thanks to support for Python on Google Cloud Dataflow, now open to the public in beta.
Along with beta support comes v0.4.0 of the Cloud Dataflow SDK for Python. This release is a distribution of Apache Beam, an exciting project under the Apache Incubator that evolved from Dataflow as described in our January announcement as well as the May update entitled Why Apache Beam. Beam unifies batch and streaming through portable SDKs for a variety of execution engines. This SDK release supports execution on Cloud Dataflow and locally. It includes connectors to BigQuery and text files in Cloud Storage, and a new source/sink API extends the SDK to any other source or sink. This paves the way for eventual feature parity between the Python and Java SDKs.
We’ve been overwhelmed by the volume and variety of interest in Python support, a top requested feature of Cloud Dataflow, and thrilled by what those in our alpha program have accomplished. Consider Veolia, a French multinational environmental services company. With equipment all over the world generating data, a team of developers used Python on Cloud Dataflow to amass industrial operational data in a data lake based on Google Cloud Storage and Google BigQuery. They run regular Cloud Dataflow pipelines that join tens of thousands of files at a time into structured tables in BigQuery. Now, all questions about factory data are an SQL query away from an answer.
Or look at Namshi, a leading fashion e-commerce retailer based in Dubai and backed by Rocket Internet. Using Python on Cloud Dataflow, Namshi automated complex retail-specific analytics and metrics over raw Google Analytics and operational data sitting in BigQuery. "It strikes a good balance between being ‘Pythonic’ and being consistent with the Beam model,” said Hisham Zarka, Namshi co-founder. “The resulting code is exceptionally clear and concise owing to the strength of the Beam programming model and the convenience of the Cloud Dataflow execution engine."
Among Cloud Dataflow alpha testers, the most common uses for Cloud Dataflow are loading data into, or performing complex analyses on, analytical databases like BigQuery; pre-processing data for machine learning, and scientific or statistical analysis. They chose Cloud Dataflow because it’s a fully managed service that doesn’t require any cluster configuration or management, freeing up data engineering or science teams to write and monitor pipelines, rather than clusters. “I am so excited to no longer baby-sit a cluster,” said one alpha tester. They also love how Python Cloud Dataflow is: data stays in Python data structures for straightforward debugging, and the Cloud Dataflow SDK for Python supports all the libraries and dependencies they’re used to, including NumPy, SciPy and Pandas.
pip install google-cloud-dataflow --user
Then, once you have the SDK, enter interactive mode with:
Within interactive mode, the next four lines (1) import Beam, (2) instantiate a pipeline, (3) add steps to the pipeline and (4) close the with statement with an empty line.
import apache_beam as beam with beam.Pipeline() as p: p | beam.Create(['hello', 'world']) | beam.io.Write(beam.io.TextFileSink('./test'))
After closing the with statement, the console should print a line indicating the pipeline executed. Let’s (1) exit interactive mode and (2) take a peek at what was written!
exit() more ./test*
Now it’s your turn. Head over to the Python Quickstart to create your own pipeline and learn the arguments for execution on Cloud Dataflow. Then change the world. Tweet me at @ericmander with your accomplishments.