Release Notes: Cloud Dataflow SDK 2.x for Java

Cloud Dataflow SDK Deprecation Notice: The Cloud Dataflow SDK 2.5.0 is the last Cloud Dataflow SDK release that is separate from the Apache Beam SDK releases. The Cloud Dataflow service fully supports official Apache Beam SDK releases. The Cloud Dataflow service also supports previously released Apache Beam SDKs starting with version 2.0.0 and above. See the Cloud Dataflow support page for the support status of various SDKs. The Apache Beam downloads page contains release notes for the Apache Beam SDK releases.

NOTE: Cloud Dataflow SDK 2.x releases have a number of significant changes from the 1.x series of releases. See the changes in the 1.x to 2.x migration guide.

In January 2016, Google announced the donation of the Cloud Dataflow SDKs to the Apache Software Foundation as part of the Apache Beam project. The Cloud Dataflow SDKs are now based on Apache Beam.

The Cloud Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Cloud Dataflow SDK for Java, based on a subset of the Apache Beam code base. For information about version 1.x of the Cloud Dataflow SDK for Java, see the Cloud Dataflow SDK 1.x for Java release notes.

The support page contains information about the support status of each release of the Cloud Dataflow SDK.

To install and use the Cloud Dataflow SDK, see the Cloud Dataflow SDK installation guide.

Warning for users upgrading from Cloud Dataflow SDK for Java 1.x:
This is a new major version, and therefore comes with the following caveats.
* Breaking Changes: The Cloud Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
* Update Incompatibility: The Cloud Dataflow SDK 2.x for Java is update-incompatible with Cloud Dataflow 1.x. Streaming jobs using a Cloud Dataflow 1.x SDK cannot be updated to use a Cloud Dataflow 2.x SDK. Cloud Dataflow 2.x pipelines may only be updated across versions starting with SDK version 2.0.0.

Cloud Dataflow SDK distribution contents

The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:

  • The core SDK
  • DirectRunner and DataflowRunner
  • I/O components for other Google Cloud Platform services
  • Apache Kafka

The Cloud Dataflow SDK distribution does not include other Beam components, such as:

  • Runners for other distributed processing engines (such as Apache Spark or Apache Flink)
  • I/O components for non-Cloud Platform services that are not explicitly listed above

If your use case requires any components that are not included, you can use the appropriate Apache Beam modules directly and still run your pipeline on the Cloud Dataflow service.

Release notes

This section provides each version's most relevant changes for Cloud Dataflow customers.

The Apache Beam downloads page contains release notes for the Apache Beam SDK releases.

2.5.0 (June 27, 2018)

FUTURE DEPRECATION NOTICE: This Cloud Dataflow SDK 2.5.0 release is the last Cloud Dataflow SDK release for Java that is separate from the Apache Beam SDK releases.

Version 2.5.0 is based on a subset of Apache Beam 2.5.0. See the Apache Beam 2.5.0 release announcement for additional change information.

Added KafkaIO as a part of the Cloud Dataflow SDK.

Enabled BigQueryIO to use a different project for load jobs in batch mode.

Added the ability to disable the batch API in SpannerIO.

Added option to provide BigQueryIO with method to indicate priority: INTERACTIVE or BATCH.

Fixed DirectRunner to hang if multiple timers are set in the same bundle.

Improved several stability, performance, and documentation issues.

2.4.0 (March 30, 2018)

Version 2.4.0 is based on a subset of Apache Beam 2.4.0. See the Apache Beam 2.4.0 release announcement for additional change information.

Added new Wait.on() transform which allows general-purpose sequencing of transforms. Wait.on() is usable with both batch and streaming pipelines.

Increased scalability of watching for new files with I/O transforms such as TextIO.read().watchForNewFiles(). Tests successfully read up to 1 million files.

BigQueryIO.write() now supports column-based partitioning, which enables cheaper and faster loading of historical data into a time-partitioned table. Column-based partitioning uses one load job total, instead of one load job per partition.

Updated SpannerIO to use Cloud Spanner's BatchQuery API. The Cloud Spanner documentation contains more information about using the Cloud Dataflow Connector.

Updated Cloud Dataflow containers to support the Java Cryptography Extension (JCE) unlimited policy.

Removed deprecated method WindowedValue.valueInEmptyWindows().

Fixed a premature data discarding issue with ApproximateUnique.

Improved several stability, performance, and documentation issues.

2.3.0 (February 28, 2018)

Version 2.3.0 is based on a subset of Apache Beam 2.3.0. See the Apache Beam 2.3.0 release announcement for additional change information.

Dropped support for Java 7.

Added template support (ValueProvider parameters) to BigtableIO.

Added new general-purpose fluent API for writing to files: FileIO.write() and FileIO.writeDynamic(). DynamicDestinations APIs in TextIO and AvroIO are now deprecated.

MapElements and FlatMapElements now support side inputs.

Fixed issues in BigQueryIO when writing to tables with partition decorators.

2.2.0 (December 8, 2017)

Version 2.2.0 is based on a subset of Apache Beam 2.2.0. See the Apache Beam 2.2.0 release notes for additional change information.

Issues

Known issue: When running in batch mode, Gauge metrics are not reported.

Known issue: SQL support is not included in this release because it is experimental and not tested on Cloud Dataflow. Using SQL is not recommended.

Updates and improvements

Added the ability to set job labels in DataflowPipelineOptions.

Added support for reading Avro GenericRecords with BigQueryIO.

Added support for multi-byte custom separators to TextIO.

Added support for watching new files (TextIO.watchForNewFiles) to TextIO.

Added support for reading a PCollection of filenames (ReadAll) to TextIO and AvroIO.

Added support for disabling validation in BigtableIO.write().

Added support for using the setTimePartitioning method in BigQueryIO.write().

Improved several stability, performance, and documentation issues.

2.1.0 (September 1, 2017)

Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam 2.1.0 release notes for additional change information.

Issues

Known issue: When running in batch mode, Gauge metrics are not reported.

Updates and improvements

Added detection for potentially stuck code, which results in Processing lulls log entries.

Added Metrics support forDataflowRunner in streaming mode.

Added OnTimeBehavior to WindowinStrategy to control emitting of ON_TIME panes.

Added default file name policy for windowed file FileBasedSinks which consume windowed input.

Fixed an issue in which processing time timers for expired windows were ignored.

Fixed an issue in which DatastoreIO failed to make progress when Datastore was slow to respond.

Fixed an issue in which bzip2 files were being partially read; added support for concatenated bzip2 files.

Improved several stability, performance, and documentation issues.

2.0.0 (May 23, 2017)

Version 2.0.0 is based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.

Issues

Identified issue: When running in streaming mode, metrics are not reported in the Cloud Dataflow UI.

Identified issue: When running in batch mode, Gauge metrics are not reported.

Updates and improvements

Added support for using the Stackdriver Error Reporting Interface.

Added new API in BigQueryIO for writing into multiple tables, possibly with different schemas, based on data. See BigQueryIO.Write.to(SerializableFunction) and BigQueryIO.Write.to(DynamicDestinations).

Added new API for writing windowed and unbounded collections to TextIO and AvroIO. For example, see TextIO.Write.withWindowedWrites() and TextIO.Write.withFilenamePolicy(FilenamePolicy).

Added TFRecordIO to read and write TensorFlow TFRecord files.

Added the ability to automatically register CoderProviders in the default CoderRegistry. CoderProviders are registered by a ServiceLoader via concrete implementations of a CoderProviderRegistrar.

Added an additional Google API requirement. You must now also enable the Cloud Resource Manager API.

Changed order of parameters for ParDo with side inputs and outputs.

Changed order of parameters for MapElements and FlatMapElements transforms when specifying an output type.

Changed the pattern for reading and writing custom types to PubsubIO and KafkaIO.

Changed the syntax for reading to and writing from TextIO, AvroIO, TFRecordIO, KinesisIO, BigQueryIO.

Changed syntax for configuring windowing parameters other than the WindowFn itself using the Window transform.

Consolidated XmlSource and XmlSink into XmlIO.

Renamed CountingInput to GenerateSequence and unified the syntax for producing bounded and unbounded sequences.

Renamed BoundedSource#splitIntoBundles to #split.

Renamed UnboundedSource#generateInitialSplits to #split.

Output from @StartBundle is no longer possible. Instead of accepting a parameter of type Context, this method may optionally accept an argument of type StartBundleContext to access PipelineOptions.

Output from @FinishBundle now always requires an explicit timestamp and window. Instead of accepting a parameter of type Context, this method may optionally accept an argument of type FinishBundleContext to access PipelineOptions and emit output to specific windows.

XmlIO is no longer part of the SDK core. It must be added manually using the new xml-io package.

Replaced Aggregator API with the new Metrics API to create user-defined metrics/counters.


Note: All 2.0.0-beta versions are DEPRECATED.

2.0.0-beta3 (March 17, 2017)

Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0.

Changed TextIO to only operate on strings.

Changed KafkaIO to specify type parameters explicitly.

Renamed factory functions of ToString.

Changed Count, Latest, Sample, SortValues transforms.

Renamed Write.Bound to Write.

Renamed Flatten transform classes.

Split GroupByKey.create method into create and createWithFewKeys methods.

2.0.0-beta2 (February 2, 2017)

Version 2.0.0-beta2 is based on a subset of Apache Beam 0.5.0.

Added PubsubIO functionality that allows the Read and Write transforms to provide access to Cloud Pub/Sub message attributes.

Added support for stateful pipelines via the new State API.

Added support for timer via the new Timer API. Support for this feature is limited to the DirectRunner in this release.

2.0.0-beta1 (January 7, 2017)

Version 2.0.0-beta1 is based on a subset of Apache Beam 0.4.0.

Improved compression: CompressedSource supports reading ZIP-compressed files. TextIO.Write and AvroIO.Write support compressed output.

Added functionality to AvroIO: Write supports the addition of custom user metadata.

Added functionality to BigQueryIO: Write splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs, allowing it to handle very large datasets.

Added functionality to BigtableIO: Write supports unbounded PCollections and can be used in streaming mode in the DataflowRunner.

For complete details, please see the release notes for Apache Beam 0.3.0-incubating, 0.4.0, and 0.5.0.

Other Apache Beam modules from the corresponding version can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow
Need help? Visit our support page.