Google Cloud

Announcing general availability of Google Cloud Dataflow for Python

As announced at Google Cloud NEXT '17, Google Cloud Dataflow is generally available for Python. Pipelines built using Apache Beam for Python will be able to use the advanced features provided by the Cloud Dataflow service. Apache Beam, recently established by the Apache Software Foundation as a Top-Level project, provides a unified programming model that allows you to implement data processing jobs that can run on any execution engine.Cloud Dataflow is a fully-managed, efficient service that offers many features for executing Beam pipelines. The service is integrated with the rest of Google Cloud Platform (GCP), providing connectors for services like Google Cloud Storage and Google BigQuery. It provides autotuning features such as Autoscaling and Dynamic Work Rebalancing that dynamically optimize your Dataflow job while it’s running and simplify pipeline configurations. The Dataflow Monitoring Interface is a streamlined UI for monitoring, logging and understanding the timing of your pipeline. 

The Cloud Dataflow service along with the Dataflow SDK for Python give you everything you need to get started. You can start using Dataflow today for your custom data processing needs and take advantage of the popular pre-installed packages, including TensorFlowNumPySciPy and pandas. For example, you can quickly experiment directly from an interactive Python shell, create a vegetation index from Landsat data or classify images with TensorFlow. You can even install non-Python dependencies to execute embarrassingly parallel R programs.

We’ve been overwhelmed by the volume and variety of interest. Veolia, a French transnational environmental services company, has been using Google Cloud Dataflow for Python in production for months now.

On the Digital Factory team at Veolia, Python is our preferred language. Our Data Science teams use it in Datalab, our web developers use it with App Engine and now our data engineering team use it with Dataflow to power our data lake. We have relied on Dataflow in production to transform, clean and insert tens of thousands of files into our BigQuery data lake everyday for the last 8 months. Our favorite part is that, like BigQuery and AppEngine, it is serverless and fully-managed, especially with the new autoscaling and monitoring features, so we only worry about writing our Python code.Alexandre Vivien
Deputy CTO of Veolia

As a start, you can easily run a pipeline to find the longest sessions in raw log data, using Beam with just a few lines of Python code. You’ll need to install google-cloud-dataflow and then you're ready to create your first pipeline:

  import apache_beam as beam
p = beam.Pipeline()
(p |'gs://your_log_data/*.json')
   # Use a DoFn for loading each json entry, and extracting relevant parts of data. 
   | beam.ParDo(ExtractUserAndTimestampDoFn())
   # Use a custom PTransform for finding top 10 longest sessions.
   | ComputeTopSessions()
   |'./top_sessions.txt'))  # Write results to a file.

And voila! Your first pipeline is running. From here, you can read the full example and use the Cloud Dataflow service to run much faster using a pool of workers.

Next steps

We've been thrilled to help this SDK mature over the past year and a half since it was first introduced, and more recently drive its inclusion into Apache Beam. And now, we're excited to offer general availability support for running it on Google Cloud. Thanks to our many customers who tried our beta with a variety of workloads and provided us with invaluable feedback!

Follow the links below to learn more:

  • See the quickstart to start running your Python pipelines on Cloud Dataflow.
  • Join the community and contribute to the open source Apache Beam project.