Release Notes: Dataflow SDK for Python

This page documents production updates to the Dataflow SDK for Python. You can periodically check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.

The Dataflow SDK for Python supports batch execution only. Streaming processing is not yet available. Batch programs can be executed locally (mostly used for development and testing purposes), or in the Google Cloud using the Cloud Dataflow service.

The support page contains information about the support status of each release of the Dataflow SDK.

To install and use the Dataflow SDK, see the Dataflow SDK installation guide.

Cloud Dataflow SDK distribution contents

The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:

  • The core SDK
  • DirectRunner and DataflowRunner
  • I/O components for other Google Cloud Platform services

The Cloud Dataflow SDK distribution does not include other Beam components, such as:

  • Runners for other distributed processing engines
  • I/O components for non-Cloud Platform services

Release notes

This section provides each version's most relevant changes for Cloud Dataflow customers.

2.2.0 (December 8, 2017)

Version 2.2.0 is based on a subset of Apache Beam 2.2.0. See the Apache Beam 2.2.0 release notes for additional change information.

Added two new Read PTransforms (textio.ReadAllFromText and avroio.ReadAllFromText) that can be used to read a very large number of files.

Added adaptive throttling support to DatastoreIO.

DirectRunner improvements: DirectRunner will optionally retry failed bundles, and streaming pipelines can now be cancelled with Ctrl + C.

DataflowRunner improvements: Added support for cancel, wait_until_finish(duration), and job labels.

Improved pydoc text and formatting.

Improved several stability, performance, and documentation issues.

2.1.1 (September 22, 2017)

Version 2.1.1 is based on a subset of Apache Beam 2.1.1.

Fixed a compatibility issue with the Python six package.

2.1.0 (September 1, 2017)

Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam 2.1.0 release notes for additional change information.

Identified issue: This release has a compatibility issue with the Python six 1.11.0 package. Work around this issue by running pip install google-cloud-dataflow==2.1.0 six==1.10.0.

Limited streaming support in DirectRunner with PubSub source and sink, and BigQuery sink.

Added support for new pipeline options --subnetwork and --dry_run. Added experimental support for new pipeline option --beam_plugins.

Fixed an issue in which bzip2 files were being partially read; added support for concatenated bzip2 files.

Improved several stability, performance, and documentation issues.

2.0.0 (May 31, 2017)

Version 2.0.0 is based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.

This release includes breaking changes.

Identified issue: This release has a compatibility issue with the Python six 1.11.0 package. Work around this issue by running pip install google-cloud-dataflow==2.0.0 six==1.10.0.

Added support for using the Stackdriver Error Reporting Interface.

Added support for Dataflow templates to file based sources and sinks (e.g. TextIO, TfRecordIO, AvroIO).

Added support for the region flag.

Added an additional Google API requirement. You must now also enable the Cloud Resource Manager API.

Moved pipeline options into options modules.

finish_bundle now only allows emitting windowed values.

Moved testing related code to apache_beam.testing, and moved assert_that, equal_to, and is_empty to apache_beam.testing.util.

Changed the following names:

  • Use apache_beam.io.filebasedsink instead of apache_beam.io.file.
  • Use apache_beam.io.filesystem instead of apache_beam.io.fileio.
  • Use TaggedOutput instead of SideOutputValue.
  • Use AfterAny instead of AfterFirst.
  • Use apache_beam.options instead of apache_beam.util for pipeline_options and related imports.

Removed the ability to emit values from start_bundle.

Removed support for all credentials other than application default credentials. Removed --service_account_name and --service_account_key_file flags.

Removed IOChannelFactory and replaced it with BeamFileSystem.

Removed deprecated context parameter from DoFn.

Removed SingletonPCollectionView, IterablePCollectionView, ListPCollectionView, and DictPCollectionView. Use AsSingleton, AsIter, AsList, and AsDict instead.

Improved several stability, performance, and documentation issues.


Note: All versions prior to 2.0.0 are DEPRECATED.

0.6.0 (March 23, 2017)

This release includes breaking changes.

Identified issue: Dataflow pipelines that read GroupByKey results more than once, either by having multiple ParDos consuming the same GroupByKey results or by reiterating over the same data in the pipeline code, may lose some data. To avoid being affected by this issue, upgrade to the Dataflow SDK for Python version 2.0.0 or later.

Added Metrics API support to the DataflowRunner.

Added support for reading and writing headers to text files.

Moved Google Cloud Platform specific IO modules to apache_beam.io.gcp namespace.

Removed label as an optional first argument for all PTransforms. Use label >> PTransform(...) instead.

Removed BlockingDataflowPipelineRunner.

Removed DataflowPipelineRunner and DirectPipelineRunner. Use DataflowRunner and DirectRunner instead.

Improved several stability, performance, and documentation issues.

0.5.5 (February 8, 2017)

This release includes breaking changes.

Added a new metric, Total Execution Time, to the Dataflow monitoring interface.

Removed the Aggregators API. Note that the Metrics API offers similar functionality.

Removed usage of non-annotation based DoFns.

Improved several stability and performance issues.

0.5.1 (January 27, 2017)

This release includes breaking changes.

Added Metrics API support to the DirectRunner.

Added source and sink implementations for TFRecordsIO.

Added support for annotation-based DoFns.

Autoscaling will be enabled by default for jobs that run using the Dataflow service, unless autoscaling_algorithm argument is explicitly set to NONE.

Renamed PTransform.apply to PTransform.expand.

Renamed apache_beam.utils.options to apache_beam.utils.pipeline_options.

Several changes to pipeline options:

  • job_name option is now optional and defaults to: beamapp-username-date(mmddhhmmss)-microseconds.
  • temp_location is now a required option.
  • staging_location is now optional and defaults to the value of temp_location option.
  • teardown_policy, disk_source_image, no_save_main_session, pipeline_type_check options are removed.
  • machine_type and disk_type option aliases have been removed.

Renamed DataflowPipelineRunner to DataflowRunner.

Renamed DirectPipelineRunner to DirectRunner.

DirectPipelineRunner is no longer blocking. To block until pipeline completion, use the wait_until_finish() method of the PipelineResult object, returned from the run() method of the runner, to block until pipeline completion.

BlockingDataflowPipelineRunner is now deprecated and will be removed in a future release.

0.4.4 (December 13, 2016)

Added support for optionally using BigQuery standard SQL.

Added support for display data.

Updated DirectPipelineRunner to support bundle based execution.

Renamed the --profile flag to --profile_cpu.

Windowed side inputs are now correctly supported.

Improved service account-based authentication. (Note: Using the --service_account_key_file command line option requires installation of pyOpenSSL.)

Improved several stability and performance issues.

0.4.3 (October 17, 2016)

Fixes package requirements.

0.4.2 (September 28, 2016)

Installations of this version on or after October 14 2016 are picking up a newer version of oauth2client, which contains breaking changes. Note that Cloud ML SDK is not affected.
Workaround: Run pip install google-cloud-dataflow oauth2client==3.0.0.

Improved performance and stability of I/O operations.

Fixed several minor bugs.

0.4.1 (September 1, 2016)

Allow TopCombineFn to take a key argument instead of a comparator.

Improved performance and stability issues in gcsio.

Improved various performance issues.

0.4.0 (July 27, 2016)

This is the first beta release and includes breaking changes.

Renamed the google.cloud.dataflow package to apache_beam.

Updated the main session to no longer be saved by default. Use the save_main_session pipeline option to save the main session.

Announced a test framework for custom sources.

Announced filebasedsource, a new module that provides a framework for creating sources for new file types.

Announced AvroSource, a new SDK-provided source that reads Avro files.

Announced support for zlib and DEFLATE compression.

Announced support for the >> operator for labeling PTransforms.

Announced support for size-estimation support to Python SDK Coders.

Improved various performance issues.

Apache Beam is a trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.

0.2.7


The 0.2.7 release includes the following changes:

  • Introduces OperationCouters.should_sample for sampling for size estimation.
  • Implements fixed sharding in TextFileSink.
  • Uses multiple file rename threads in finalize_write method.
  • Retries idempotent I/O operations on Cloud Storage timeout.

0.2.6


The 0.2.6 release includes the following changes:

  • Pipeline objects are now allowed to be used in Python with statements.
  • Fixed several bugs including module dictionary pickling and buffer overruns in fast OutputStream.

0.2.5


The 0.2.5 release includes the following changes:

  • Added support for creating custom sources and reading from them in pipelines executed using DirectRunner and DataflowRunner.
  • Added DiskCachedPipelineRunner as a disk-backed alternative to DirectRunner.
  • Changed how undeclared side outputs of DoFns in cloud executor are handled; they are now ignored.
  • Fixed pickling issue when the Seaborn package is loaded.
  • Text files output sink can now write gzip compressed files.

0.2.4


The 0.2.4 release includes the following changes:

  • Added support for large iterable side inputs.
  • Enabled support for all supported counter types.
  • Modified --requirements_file behavior to cache packages locally.
  • Added support for non-native TextFileSink.

0.2.3


The 0.2.3 release includes the following fixes:

  • The google-apitools version is no longer required to be pinned.
  • The oauth2client version is no longer required to be pinned.
  • Fixed import errors that were raised during installation of the latest gcloud package and the Dataflow SDK for the statement import google.
  • Fixed the code to raise the correct exception for failures in start and finish of DoFn methods.

0.2.2


The 0.2.2 release includes the following changes:

  • Improved memory footprint for DirectPipelineRunner.
  • Fixed multiple bugs:
    • Fixed BigQuerySink schema record field type handling.
    • Added clearer error messages for missing files.
  • Created a new example using more complex BigQuery schemas.
  • Improved several performance issues by:
    • Reducing debug logging
    • Compiling some files with Cython

0.2.1


The 0.2.1 release includes the following changes:

  • Optimized performance for the following features:
    • Logging
    • Shuffle Writing
    • Using Coders
    • Compiling some of the worker modules with Cython
  • Changed the default behavior for Cloud execution: Instead of downloading the SDK from a Cloud Storage bucket, you now download the SDK as a tarball from GitHub. When you run jobs using the Dataflow service, the SDK version used will match the version you've downloaded (to your local environment). You can use the --sdk_location pipeline option to override this behavior and provide an explicit tarball location (Cloud Storage path or URL).
  • Fixed several pickling issues related to how Dataflow serializes user functions and data.
  • Fixed several worker lease expiration issues experienced when processing large datasets.
  • Improved validation to detect various common errors, such as access issues and invalid parameter combinations, much earlier in time.

Send feedback about...

Cloud Dataflow