Release Notes: Cloud Dataflow service

This page documents production updates to the Cloud Dataflow service. You can periodically check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.

To get the latest product updates delivered to you, add the URL of this page to your feed reader, or add the feed URL directly: https://cloud.google.com/feeds/cloud-dataflow-release-notes.xml

October 08, 2019

Python streaming for Apache Beam SDK 2.16 or higher is generally available. You can now do the following in Python:

Python 3 support for Apache Beam SDK 2.16.0 or higher is now generally available. This feature provides support for using Python 3.5, 3.6, and 3.7. You can run run any existing Python 2.7 batch and streaming pipelines that use DirectRunner or DataflowRunner. However, you might need to make changes to ensure that your pipeline code is compatible with Python 3. Keyword-only arguments (a syntactic construct introduced in Python 3) are not yet supported by Apache Beam SDK. For the current status and summary of recent Python 3-specific improvements, follow updates on the Apache Beam issue tracker.

October 07, 2019

Cloud Dataflow Shuffle and Streaming Engine are now available in two additional regions:

  • us-west1 (Oregon)
  • asia-east1 (Taiwan)

September 03, 2019

Automatic hot key detection is now enabled in batch pipelines for Apache Beam SDK 2.15.0 or higher.

August 09, 2019

Integration with Cloud Dataflow VPC Service Controls is generally available.

August 02, 2019

Using Cloud Dataflow with Cloud Key Management Service is now available in beta. Customer-managed encryption keys (CMEK) allow for encryption of your pipeline state. This feature is limited to Persistent Disks attached to Cloud Dataflow workers and used for Persistent Disk-based shuffle and streaming state storage.

August 01, 2019

Python 3 support for Apache Beam SDK 2.14.0 or higher is now in beta. This feature provides support for using Python 3.5, 3.6, and 3.7. You can run any existing Python 2.7 batch and streaming pipelines that use DirectRunner or DataflowRunner. However, you might need to make changes to ensure that your pipeline code is compatible with Python 3. Some syntactic constructs introduced in Python 3 are not yet fully supported by the Apache Beam SDK. For details and current status, follow updates on the Apache Beam issue tracker.

May 16, 2019

Cloud Dataflow SQL is now publicly available in alpha. Cloud Dataflow SQL lets you use SQL queries to develop and run Cloud Dataflow jobs from the BigQuery web UI.

April 18, 2019

Cloud Dataflow is now able to use workers in zones in the asia-northeast2 region (Osaka, Japan).

April 10, 2019

Cloud Dataflow Streaming Engine is generally available. The service is available in two additional regions:

  • asia-northeast1 (Tokyo)
  • europe-west4 (Netherlands)

Note that Streaming Engine requires the Apache Beam SDK for Java, versions 2.10.0 or higher.

Cloud Dataflow Shuffle is now available in two additional regions:

  • asia-northeast1 (Tokyo)
  • europe-west4 (Netherlands)

Cloud Dataflow provides beta support for Flexible Resource Scheduling (FlexRS) in the us-central1 and europe-west1 regions.

Streaming autoscaling is generally available for pipelines that use Streaming Engine.

April 08, 2019

Apache Beam SDK for Python can only use BigQuery resources in the following regions:

  • Regional locations: us-west2, us-east4, europe-north1, europe-west2, europe-west6.
  • Multi-regional locations: EU and US.

Cloud Dataflow provides beta support for Flexible Resource Scheduling (FlexRS) in the us-central1 and europe-west1 regions.

April 01, 2019

Cloud Dataflow provides beta support for VPC Service Controls.

March 24, 2019

The following SDK versions will be decommissioned later in 2019 due to the discontinuation of support for JSON-RPC and Global HTTP Batch Endpoints. Note that this change overrides the release note from December 17, that states that decommissioning was expected to happen in March 2019.

  • Apache Beam SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
  • Apache Beam SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)
  • Cloud Dataflow SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
  • Cloud Dataflow SDK for Python, 2.0.0 to 2.4.0 (inclusive)

See the SDK version support status page for detailed SDK support status.

March 20, 2019

Apache Beam SDK 2.4.0 and Cloud Dataflow SDK 2.4.0 are now deprecated. For detailed support status information, see the SDK version support status table.

March 11, 2019

Cloud Dataflow is now able to use workers in zones in the europe-west6 region (Zürich, Switzerland).

March 06, 2019

Apache Beam SDK 2.10.0 depends on gcsio client library version 1.9.13, which has known issues:

To work around these issues, either upgrade to Apache Beam SDK 2.11.0, or override the gcsio client library version to 1.9.16 or later.

February 25, 2019

You can now view system latency and data freshness metrics for your pipeline in the Cloud Dataflow monitoring interface.

February 20, 2019

2018-2019-Apache Beam SDK 2.10.0 contains fixes for the known issues disclosed on December 20 and February 4.

February 04, 2019

In a specific case, users of Apache Beam Java SDKs (2.9.0 and earlier) and Cloud Dataflow Java SDKs (2.5.0 and earlier) might experience data duplication when reading files from Cloud Storage. Duplication might occur when all of the following conditions are true:

  • You are reading files with the content-encoding set to gzip, and the files are dynamically decompressive transcoded by Cloud Storage.

  • The file size (decompressed) is larger than 2.14 GB.

  • The input stream runs into an error (and is recreated) after 2.14 GB is read.

As a workaround, do not set the content-encoding header, and store compressed files in Cloud Storage with the proper extension (for example, gz for gzip). For existing files, you can update the content-encoding header and file name with the gsutil tool.

December 20, 2018

Streaming Engine users should not upgrade to SDK 2.9.0 due to a known issue. If you choose to use SDK 2.9.0, you must also set the enable_conscrypt_security_provider experimental flag to enable conscrypt, which has known stability issues.

December 17, 2018

2019-The following decommission notice has been changed. For more information, see the release note for March 24.

2019-The following SDK versions will be decommissioned on March 25 due to the discontinuation of support for JSON-RPC and Global HTTP Batch Endpoints. Shortly after this date, you will no longer be able to submit new Cloud Dataflow jobs or update running Cloud Dataflow jobs that use the decommissioned SDKs. In addition, existing streaming jobs that use these SDK versions might fail.

  • Apache Beam SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
  • Apache Beam SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)
  • Cloud Dataflow SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
  • Cloud Dataflow SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)

See the SDK version support status page for detailed SDK support status.

October 22, 2018

Cloud Dataflow is now able to use workers in zones in the asia-east2 region (Hong Kong).

October 16, 2018

2018-Cloud Dataflow SDK 1.x for Java is unsupported as of October 16. In the near future, the Cloud Dataflow service will reject new Cloud Dataflow jobs that are based on Cloud Dataflow SDK 1.x for Java. See Migrating from Cloud Dataflow SDK 1.x for Java for migration guidance.

October 03, 2018

Cloud Dataflow now has a Public IP parameter that allows you to turn off public IP addresses for your worker nodes.

July 16, 2018

Cloud Dataflow Shuffle is now generally available.

July 10, 2018

Cloud Dataflow is now able to use workers in zones in the us-west2 region (Los Angeles).

June 14, 2018

Streaming Engine is now publicly available in beta. Streaming Engine moves streaming pipeline execution out of the worker VMs and into the Cloud Dataflow service backend.

June 11, 2018

You can now specify a user-managed controller service account when you run your pipeline job.

Cloud Dataflow is now able to use workers in zones in the europe-north1 region (Finland).

April 26, 2018

You can now view side input metrics for your pipeline from the Cloud Dataflow monitoring interface.

February 21, 2018

Cloud Dataflow now supports the following regional endpoints in GA: us-central1, us-east1, europe-west1, asia-east1, and asia-northeast1.

January 10, 2018

Cloud Dataflow is now able to use workers in zones in the northamerica-northeast1 region (Montréal).

Cloud Dataflow is now able to use workers in zones in the europe-west4 region (Netherlands).

October 31, 2017

Cloud Dataflow is now able to use workers in zones in the asia-south1 region (Mumbai).

October 30, 2017

Cloud Dataflow Shuffle is now available in the europe-west1 region.

Cloud Dataflow Shuffle is now available for pipelines using the Apache Beam SDK for Python version 2.1 or later.

October 25, 2017

October 12, 2017

Fixed the known issue disclosed on October 2, 2017.

October 02, 2017

Cloud Dataflow 2.x pipelines in which the output of a PTransform is consumed by a flatten and at least one other PTransform results in a malformed graph, leaving the other PTransforms input-less.

September 20, 2017

Cloud Dataflow provides beta support for regional endpoints us-central1 and europe-west1.

September 05, 2017

Cloud Dataflow is now able to use workers in zones in the southamerica-east1 region (São Paulo).

August 01, 2017

Cloud Dataflow is now able to use workers in zones in the europe-west3 region (Frankfurt).

July 20, 2017

You can now access the Stackdriver error report for your pipeline directly from the Dataflow monitoring interface.

June 20, 2017

Cloud Dataflow is now able to use workers in zones in the australia-southeast1 region (Sydney).

June 06, 2017

Cloud Dataflow is now able to use workers in zones in the europe-west2 region (London).

April 25, 2017

Per-step worker logs are now accessible directly in the Cloud Dataflow UI. Consult the documentation for more information.

April 11, 2017

The Cloud Dataflow service will now automatically shut down a streaming job if all steps have reached the maximum watermark. This will only affect pipelines in which every source produces only bounded input – e.g., streaming pipelines reading from Cloud Pub/Sub are not affected.

April 03, 2017

Improved graph layout in the Cloud Dataflow UI.

September 29, 2016

Autoscaling for streaming pipelines is now publicly available in beta for use with select sources and sinks. See the autoscaling documentation for more details.

September 15, 2016

The default autoscaling ceiling for batch pipelines using the Cloud Dataflow SDK for Java 1.6 or newer has been raised to 10 worker VMs. You can specify an alternate ceiling using the --maxNumWorkers pipeline option. See the autoscaling documentation for more details.

August 18, 2016

Autoscaling for batch pipelines using the Cloud Dataflow SDK for Java 1.6 or higher is now being enabled by default. This change will be rolled out to projects over the next several days. By default, the Cloud Dataflow service will cap the dynamic number of workers to a ceiling of 5 worker VMs. The default autoscaling ceiling may be raised in future service releases. You can specify an alternate ceiling using the --maxNumWorkers pipeline option. See autoscaling documentation for more details.

July 27, 2016

Announced beta support for the 0.4.0 release of the Cloud Dataflow SDK for Python. Get started and run your pipeline remotely on the service.

Default disk size for pipelines in streaming mode is now 420GB. This change will be rolled out to projects over the next several days.

March 14, 2016

Scalability and performance improvements available when using Cloud Dataflow SDK for Java version 1.5.0:

  • The service now scales to tens of thousands of initial splits when reading from a BoundedSource. This includes TextIO.Read, AvroIO.Read, and BigtableIO.Read, among others.
  • The service will now use Avro instead of JSON as a BigQuery export format for BigQueryIO.Read. This change greatly increases the efficiency and performance when reading from BigQuery.

January 29, 2016

Changes to the runtime environment for streaming jobs:

  • Files uploaded with --filesToStage were previously downloaded to: /dataflow/packages on the workers. With the latest service release, files will now be in the location /var/opt/google/dataflow. This change was a cleanup intended to better follow standard Linux path conventions.

January 19, 2016

Changes to the runtime environment for batch jobs:

  • Files uploaded with --filesToStage were previously downloaded to: /dataflow/packages on the workers. With the latest service release, files will now be in the location /var/opt/google/dataflow. This change was a cleanup intended to better follow standard Linux path conventions.

November 13, 2015

Usability improvements in the Monitoring UI:

  • The Job Log tab has been renamed Logs.
  • The View Log button has moved into the Logs tab, and renamed Worker Logs.

Performance and stability improvements for Streaming pipelines:

  • Addressed a condition that caused a slowly-growing memory usage in streaming workers.
  • Large Window buffers no longer need to fit entirely in memory at once.
  • Improved disk assignment to avoid data locality hotspots.
  • Worker logging is now optimized to avoid filling up the local disk.

August 12, 2015

The Cloud Dataflow Service is now generally available.

August 06, 2015

Monitoring changes:

  • Added JOB_STATE_CANCELLED as a possible state value for Cloud Dataflow jobs in the Monitoring UI and command-line interface. Appears when the user cancels a job.
  • Temporarily, as part of the above job state introduction, jobs may may show different job states in list view relative to the single job view.
  • Added Compute Engine core-hour count field to the monitoring UI and enabled core-hour counting for bounded jobs (field is populated with "-" for unbounded jobs).

Performance improvements to the unbounded runner.

July 28, 2015

Added a check during job creation to ensure active job names are unique within each project. You may no longer create a new job with the same name as an active job. If there are already active jobs with the same name running in the system, they will not be impacted by this change.

April 23, 2015

Improvements to the monitoring UI. Clicking View Log for a stage now defaults to display the logs generated by user code on the worker machines.

April 16, 2015

The Cloud Dataflow Service is now in beta.

Improvements to the monitoring UI: The job details page now provides more job information including job duration, and job type. For streaming pipelines, it additionally provides data watermark.

April 13, 2015

Command line interface now available for Cloud Dataflow in gcloud alpha.

Default disk size in batch is 250 GB.

April 09, 2015

Improvements to the monitoring UI: Improved organization of pipeline visualization.

Default VM for batch jobs is now n1-standard-1.

Improved resource teardown operations on job completion and cancellations.

Performance improvements for the service.

April 03, 2015

Improvements to the monitoring UI: The list of jobs now includes name, type, start time, and job ID.

March 27, 2015

Improved mechanisms for elastic scaling of compute resources. Batch pipelines can now grow and shrink the worker pool size at different stages of execution.

March 20, 2015

Monitoring changes:

  • Jobs summary page now shows the status of the current job.
  • Performance improvements to the UI.

March 06, 2015

Workers now use the Java 8 runtime.

March 01, 2015

Dynamic work rebalancing

Streaming support enabled for all projects participating in alpha.

Us ha resultat útil aquesta pàgina? Feu-nos-ho saber:

Envia suggeriments sobre...

Necessiteu ajuda? Visiteu la nostra pàgina d'assistència.