Dataflow Stream Processing now supports Python
By Anand Iyer, Product Manager, and Ahmet Altay, Software Engineer
Data engineers and data scientists increasingly look to implement real-time stream processing jobs to derive insights from data as it streams into their environment. Of course, as these data professionals build real-time pipelines, they want to be able to use the programming languages they’re most familiar with. At Google Cloud, we’ve noticed the rapid growth of the Python programming language, and heard from customers that they want to author stream processing jobs with Python. Today, we’re answering that demand with the public Beta release of stream processing capabilities in the Python SDK for Cloud Dataflow.
The Python SDK is based on Apache Beam. Apache Beam, an open source project from the Apache Software Foundation, provides a unified programming model for authoring batch and streaming workloads that can be executed on a variety of execution engines. Cloud Dataflow is a fully-managed serverless execution engine for Apache Beam jobs. With today’s release, customers can author streaming jobs in Python and run them on Dataflow. Beam’s intuitive and succinct Python APIs, in conjunction with Dataflow’s managed capabilities, maximize the productivity of engineers and reduce the barriers to authoring streaming workloads.
Some early enthusiasm from customers
Energyworx uses Dataflow to process extensive amounts of time series data for the energy industry. We are using the Python SDK for Dataflow to deliver our intelligence capabilities, allowing us and our customers to create and plug in any algorithms with ease and run it at scale across billions of data sources at the same time. We are excited to use the beta release with streaming capabilities and look forward to the new features that will be added to the GA release. We also are really happy with the good and direct support we get as an early stage adopter from the Google Dev team.
Shine.fr uses Dataflow extensively. The Python SDK for Dataflow provides a simple and intuitive interface to author streaming jobs in our preferred language, Python. We have been successfully using the Python Streaming SDK, and eagerly look forward to the new capabilities that will be added in subsequent releases.
Getting started is easyInstall version 2.5 (or above) of the SDK by following the instructions in the quickstart. Then, you're ready to create your first pipeline.
In keeping with big data tradition, let's look at a word count example. Here we have a snippet of code that consumes a stream of text data from Cloud Pub/Sub, defines a fixed window of size 15 minutes, computes the count of distinct words within each 15 minute window, and then writes the results to Pub/Sub:
return re.findall(r'[A-Za-z0-9\']+', lines)
(word, ones) = word_ones
return '%s: %d' % (word, sum(ones))
p = beam.Pipeline(options=pipeline_options)
counts = (p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)
| 'split' >> (beam.FlatMap(split_fn)
| 'pair_with_one' >> beam.Map(lambda x: (x, 1))
| beam.WindowInto(window.FixedWindows(15, 0))
| 'group' >> beam.GroupByKey()
| 'count' >> beam.Map(count_ones)
For the complete example, see Apache Beam’s
This Beta release supports reading from Pub/Sub topics and writing to Pub/Sub topics or BigQuery tables. It has support for fixed, sliding, and session windows, as well as support for basic triggering semantics. For further details, please consult the Apache Beam Programming Guide.
In subsequent releases, we plan to expand the set of supported I/O connectors, add support for complex triggers, stateful semantics, and Dataflow’s autoscaling feature.
We encourage you to take Python streaming on Dataflow for a spin and, as always, send us your feedback!