Debunking myths about Python on Dataflow
Mehran Nazir
Product Manager, Google Cloud
For many developers that come to Dataflow, Google Cloud’s fully managed data processing service, the first decision they have to make is which programming language to use. Dataflow developers use the open-source Apache Beam SDK to author their pipelines, and have several choices for language to use: Java, Python, Go, SQL, Scala, and Kotlin. In this post, we’ll focus on one of our fastest growing languages: Python.
The Python SDK for Apache Beam was introduced shortly after Dataflow’s general availability was announced in 2015, and is the primary choice for several of our largest customers. However, it has suffered from a reputation for being incomplete and inferior to its predecessor, the Java SDK. While historically there was some truth to this perception, Python’s feature set has caught up to the Java SDK and offers new capabilities that are specifically catered to Python developers. We’ll take the rest of this blog to inspect some popular myths, and conclude with a brief review of the latest & greatest for the Python SDK.
Myth 1: Python doesn’t support streaming pipelines.
BUSTED. Streaming support for Python has been available for more than two years, released as part of Beam 2.16 in October 2019. This means all of the unique capabilities of streaming Dataflow, including Streaming Engine, update, drain, and snapshots are all available for Python users.
Myth 2: SqlTransform isn’t available for Python.
BUSTED. Tired of writing tedious code to join together datastreams? Use SqlTransform for Python. Apache Beam introduced support for SqlTransform to the Python SDK last year as part of our advancements with multi-language pipelines (more on that later). Take a look at this example to get started.
Myth 3: State and Timer APIs aren’t available in Python.
BUSTED. Two of the most powerful features of the Beam SDK are the State and Timer APIs, which allow for more fine-grained control over aggregations than windows and triggers do. These APIs are available in Python, and offer parity with the Java SDK for the most common use cases. Reference the Beam programming guide for some examples of what you can do with these APIs.
Myth 4: There is support for a limited set of I/O’s in Python.
BUSTED. The most glaring disparity between the Java and Python SDKs is the discrepancy between I/O connectors, which facilitate read & write operations for Dataflow pipelines. Our support for multi-language pipelines puts this myth to rest. With cross-language transformations, Python pipelines can invoke a Java transformation underneath the hood to provide access to the entire library of Java-based I/O connectors. In fact, that’s how we implemented the KafkaIO module for Python (see example). Developers can invoke their own cross-language transformations using the instructions in the patterns library.
Myth 5: There are fewer code samples in Python.
PLAUSIBLE: Apache Beam maintains several repos for Python examples: one for snippets, one for notebook samples, and one for complete examples. However, there are a couple of notable exceptions where Python is missing, namely our Dataflow Templates repository. This is attributable to the fact that most of Dataflow’s initial users were Java developers. But this quick observation ignores two key factors: 1) the unique assets that are only available for Python developers, and 2) the tremendous momentum behind the Beam Python community.
Python developers love writing exploratory code in JupyterLab notebook environments. Beam offers an interactive module that allows you to interactively build and run your Beam Python pipelines in a Jupyter Notebook. We make deploying these notebooks really simple with Beam Notebooks, which spins up a managed Notebook that contains all the required Beam libraries to prototype your pipelines. We also have a number of helpful examples & tutorials that show how you can sample data from a streaming source, or attach GPUs to your notebooks to accelerate your processing. The notebook also contains a learning track for new Beam developers that cover everything from basic operations, aggregations, and streaming concepts. You can review the documentation here.
Over the past few years, we have seen a number of extensions built on top of the Beam Python SDK. Cruise Automation published the Terra library, which enables 70+ Cruise data scientists to submit jobs without having to understand the underlying infrastructure. Spotify open-sourced Klio, a framework built on top of Beam Python that simplifies common tasks required for processing audio and media files. I have even pointed customers to beam-nuggets, a lightweight collection of Beam Python transformations used for reading/writing from/to relational databases. Open-source developers and large organizations are doubling down on Beam Python, and these brief examples underscore that trend.
What's new:
The Dataflow team has a slew of new capabilities that will help Python developers advance their use case. Here’s a quick run-down of the newest features:
Custom containers: Users can now specify their own container image when they launch their Dataflow job. This is a common ask from our Python audience, who like to package their pipeline code with their own libraries and dependencies. We’re excited to announce that this feature is generally available—take a look at the documentation so you can try for yourself!
GPUs: Dataflow recently announced the general availability of GPU support on Dataflow. You can now accelerate your data processing by provisioning GPUs on your Dataflow job, another common request from machine learning practitioners on Dataflow. You can review the details of the launch here.
Beam DataFrames: Beam DataFrames brings the magic of pandas to the Beam SDK, allowing developers to convert PCollections to DataFrames and use the standard methods available with the pandas DataFrame API. DataFrames gives developers a more natural way to interact with their datasets and create their pipelines, and will be a stepping stone to future efficiency improvements. Beam DataFrames are generally available starting Beam 2.32, which was released in August. Learn more about Beam DataFrames here.
We invite you to try out our new features using Beam Notebooks today!
Do you have an interesting idea that you want to share with our Beam community? You can reach out to us through various modes, all found here. We are excited to see what’s next for Beam Python.