Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Google announces Cloud Dataflow with Python support

Tuesday, March 22, 2016

Today, we're happy to announce Alpha support for executing batch processing jobs with the Cloud Dataflow SDK for Python. This SDK is a pure Python implementation of the Apache Beam computation model (formerly known as Dataflow model), which we recently contributed to the Apache Software Foundation as an incubating project.

This release allows new categories of users, particularly Python developers and members of the scientific community, to benefit from the Apache Beam model for large-scale distributed data processing. For example, users of the popular NumPy, SciPy, and pandas packages can now use Apache Beam to implement their calculations at larger scales.

The following code sample shows how the Apache Beam Python model can implement a complete processing pipeline, capable of processing a huge collection of records, in just a few lines of code:
# ... other imports ...
import google.cloud.dataflow as df

@df.typehints.with_output_types(df.typehints.Tuple[int, float])
def parse_sales_record(line):
  # Lines look like this: 
  # {"Timestamp": 1234.56, "Price": 10, "ProductName": "Name", "ProductID": 4}
record = json.loads(line)
return int(record['ProductID']), float(record['Price'])

p = df.Pipeline(...options...)
(p
    | df.io.Read(df.io.TextFileSource('gs://SOMEBUCKET/PATH/*.json'))
    | df.Map(parse_sales_record)
    | df.CombinePerKey(sum)
    | df.Map(lambda (product, value): {'ProductID': product, 'Value': value})
    | df.io.Write(df.io.BigQuerySink('SOMEDATASET.SOMETABLE'
        schema='ProductID:INTEGER, Value:FLOAT',
        create_disposition=df.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=df.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

This pipeline processes sales records stored in newline-delimited JSON files, and writes the aggregated sales results to a Google BigQuery table. This same pipeline can work with data sets of virtually any size, in small or huge files, taking advantage of the full power of the Google Cloud Dataflow managed service. Dataflow's ability to optimize and orchestrate workflows automatically, along with automatic parallelization, scheduling and failure retry, are now available to Python developers.

We look forward to providing Python developers with the same distributed data processing ability available to Java developers since Cloud Dataflow has become generally available. If you want to learn more:

Posted by Silviu Calinoiu, Software Engineer

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

60 Day FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 60 days.

TRY IT FREE