NOTE: The Apache Beam downloads page contains release notes for the Apache Beam SDK releases.
NOTE: Cloud Dataflow SDK 2.x releases have a number of significant changes from the 1.x series of releases. See the changes in the 1.x to 2.x migration guide.
In January 2016, Google announced the donation of the Dataflow SDKs to the Apache Software Foundation as part of the Apache Beam project. The Dataflow SDKs are now based on Apache Beam.
The Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base. For information about version 1.x of the Dataflow SDK for Java, see the Dataflow SDK 1.x for Java release notes.
The SDK version support status page contains information about the support status of each release of the Dataflow SDK.
To install and use the Dataflow SDK, see the Dataflow SDK installation guide.
You can see the latest product updates for all of Google Cloud on the Google Cloud release notes page.
Warning for users upgrading from Dataflow SDK for Java 1.x:
This is a new major version, and therefore comes with the following caveats.
* Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking
changes from the 1.x series of releases. Please see below for details.
* Update Incompatibility: The Dataflow SDK 2.x for Java is
update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK
cannot be updated to use a Dataflow 2.x SDK. Dataflow 2.x pipelines may only be
updated across versions starting with SDK version 2.0.0.
Cloud Dataflow SDK distribution contents
The Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Dataflow service, such as:
- The core SDK
- DirectRunner and DataflowRunner
- I/O components for other Google Cloud Platform services
- Apache Kafka
The Dataflow SDK distribution does not include other Beam components, such as:
- Runners for other distributed processing engines (such as Apache Spark or Apache Flink)
- I/O components for non-Cloud Platform services that are not explicitly listed above
If your use case requires any components that are not included, you can use the appropriate Apache Beam modules directly and still run your pipeline on the Dataflow service.
Release notes
This section provides each version's most relevant changes for Dataflow customers.
The Apache Beam downloads page contains release notes for the Apache Beam SDK releases.
March 24, 2019
The following SDK versions will be decommissioned later in 2019 due to the discontinuation of support for JSON-RPC and Global HTTP Batch Endpoints. Note that this change overrides the release note from December 17, 2018, that states that decommissioning was expected to happen in March 2019.
- Apache Beam SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
- Apache Beam SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)
- Dataflow SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
- Dataflow SDK for Python, 2.0.0 to 2.4.0 (inclusive)
See the SDK version support status page for detailed SDK support status.
December 17, 2018
The following SDK versions will be decommissioned on March 25, 2019 due to the discontinuation of support for JSON-RPC and Global HTTP Batch Endpoints. Shortly after this date, you will no longer be able to submit new Dataflow jobs or update running Dataflow jobs that use the decommissioned SDKs. In addition, existing streaming jobs that use these SDK versions might fail.
- Apache Beam SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
- Apache Beam SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)
- Dataflow SDK for Java, versions 2.0.0 to 2.4.0 (inclusive)
- Dataflow SDK for Python, versions 2.0.0 to 2.4.0 (inclusive)
See the SDK version support status page for detailed SDK support status.
2.5.0 (June 27, 2018)
FUTURE DEPRECATION NOTICE: This Dataflow SDK 2.5.0 release is the last Dataflow SDK release for Java that is separate from the Apache Beam SDK releases.
Version 2.5.0 is based on a subset of Apache Beam 2.5.0. See the Apache Beam 2.5.0 release announcement for additional change information.
Added KafkaIO
as a part of the
Dataflow SDK.
Enabled BigQueryIO
to use a
different project for load jobs in batch mode.
Added the ability to disable the batch API in
SpannerIO
.
Added option to provide BigQueryIO
with method to indicate priority: INTERACTIVE
or
BATCH
.
Fixed DirectRunner
to hang if
multiple timers are set in the same bundle.
Improved several stability, performance, and documentation issues.
2.4.0 (March 30, 2018)
Version 2.4.0 is based on a subset of Apache Beam 2.4.0. See the Apache Beam 2.4.0 release announcement for additional change information.
Added new Wait.on()
transform which allows general-purpose sequencing of transforms. Wait.on()
is usable with both batch and streaming pipelines.
Increased scalability of watching for new files with I/O transforms such as TextIO.read().watchForNewFiles()
. Tests successfully read up to 1 million files.
BigQueryIO.write()
now supports column-based partitioning, which enables cheaper and faster loading of historical data into a time-partitioned table. Column-based partitioning uses one load job total, instead of one load job per partition.
Updated SpannerIO to use Cloud Spanner's BatchQuery API. The Spanner documentation contains more information about using the Dataflow Connector.
Updated Dataflow containers to support the Java Cryptography Extension (JCE) unlimited policy.
Removed deprecated method WindowedValue.valueInEmptyWindows()
.
Fixed a premature data discarding issue with ApproximateUnique
.
Improved several stability, performance, and documentation issues.
2.3.0 (February 28, 2018)
Version 2.3.0 is based on a subset of Apache Beam 2.3.0. See the Apache Beam 2.3.0 release announcement for additional change information.
Dropped support for Java 7.
Added template support (ValueProvider
parameters) to BigtableIO.
Added new general-purpose fluent API for writing to files: FileIO.write()
and FileIO.writeDynamic()
. DynamicDestinations
APIs in TextIO
and AvroIO
are now deprecated.
MapElements
and FlatMapElements
now support side inputs.
Fixed issues in BigQueryIO
when writing to tables with partition decorators.
2.2.0 (December 8, 2017)
Version 2.2.0 is based on a subset of Apache Beam 2.2.0. See the Apache Beam 2.2.0 release notes for additional change information.
Issues
Known issue: When running in batch mode, Gauge metrics are not reported.
Known issue: SQL support is not included in this release because it is experimental and not tested on Dataflow. Using SQL is not recommended.
Updates and improvements
Added the ability to set job labels in DataflowPipelineOptions
.
Added support for reading Avro GenericRecords
with BigQueryIO
.
Added support for multi-byte custom separators to TextIO
.
Added support for watching new files (TextIO.watchForNewFiles
) to TextIO
.
Added support for reading a PCollection
of filenames (ReadAll
) to TextIO
and AvroIO
.
Added support for disabling validation in BigtableIO.write()
.
Added support for using the setTimePartitioning
method in BigQueryIO.write()
.
Improved several stability, performance, and documentation issues.
2.1.0 (September 1, 2017)
Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam 2.1.0 release notes for additional change information.
Issues
Known issue: When running in batch mode, Gauge metrics are not reported.
Updates and improvements
Added detection for potentially stuck code, which results in Processing lulls
log entries.
Added Metrics support forDataflowRunner
in streaming mode.
Added OnTimeBehavior
to WindowinStrategy
to control emitting of ON_TIME
panes.
Added default file name policy for windowed file FileBasedSink
s which consume windowed input.
Fixed an issue in which processing time timers for expired windows were ignored.
Fixed an issue in which DatastoreIO
failed to make progress when Datastore was slow to respond.
Fixed an issue in which bzip2
files were being partially read; added support for concatenated bzip2
files.
Improved several stability, performance, and documentation issues.
2.0.0 (May 23, 2017)
Version 2.0.0 is based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.
Issues
Identified issue: When running in streaming mode, metrics are not reported in the Dataflow UI.
Identified issue: When running in batch mode, Gauge metrics are not reported.
Updates and improvements
Added support for using the Stackdriver Error Reporting Interface.
Added new API in BigQueryIO
for writing
into multiple tables, possibly with different schemas, based on data. See
BigQueryIO.Write.to(SerializableFunction) and
BigQueryIO.Write.to(DynamicDestinations).
Added new API for writing windowed and unbounded
collections to TextIO
and AvroIO
. For example, see
TextIO.Write.withWindowedWrites() and
TextIO.Write.withFilenamePolicy(FilenamePolicy).
Added TFRecordIO
to read and write
TensorFlow TFRecord files.
Added the ability to automatically register
CoderProvider
s in the default CoderRegistry
.
CoderProvider
s are registered by a ServiceLoader
via
concrete implementations of a CoderProviderRegistrar
.
Added an additional Google API requirement. You must now also enable the Cloud Resource Manager API.
Changed order of parameters for ParDo
with side inputs and outputs.
Changed order of parameters for
MapElements
and FlatMapElements
transforms when
specifying an output type.
Changed the pattern for reading and writing custom
types to PubsubIO
and KafkaIO
.
Changed the syntax for reading to and writing from
TextIO
, AvroIO
, TFRecordIO
,
KinesisIO
, BigQueryIO
.
Changed syntax for configuring windowing parameters
other than the WindowFn
itself using the Window
transform.
Consolidated XmlSource
and
XmlSink
into XmlIO
.
Renamed CountingInput
to
GenerateSequence
and unified the syntax for producing bounded and
unbounded sequences.
Renamed BoundedSource#splitIntoBundles
to #split
.
Renamed UnboundedSource#generateInitialSplits
to #split
.
Output from @StartBundle
is no longer
possible. Instead of accepting a parameter of type Context
, this
method may optionally accept an argument of type StartBundleContext
to access PipelineOptions
.
Output from @FinishBundle
now always
requires an explicit timestamp and window. Instead of accepting a parameter of
type Context
, this method may optionally accept an argument of type
FinishBundleContext
to access PipelineOptions
and emit
output to specific windows.
XmlIO
is no longer part of the SDK
core. It must be added manually using the new xml-io
package.
Replaced Aggregator
API with the new
Metrics
API to create user-defined metrics/counters.
Note: All 2.0.0-beta versions are DEPRECATED.
2.0.0-beta3 (March 17, 2017)
Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0.
Changed TextIO
to only operate on
strings.
Changed KafkaIO
to specify type
parameters explicitly.
Renamed factory functions of
ToString
.
Changed Count
, Latest
,
Sample
, SortValues
transforms.
Renamed Write.Bound
to
Write
.
Renamed Flatten
transform classes.
Split GroupByKey.create
method into
create
and createWithFewKeys
methods.
2.0.0-beta2 (February 2, 2017)
Version 2.0.0-beta2 is based on a subset of Apache Beam 0.5.0.
Added PubsubIO functionality that allows the
Read
and Write
transforms to provide access to Cloud
Pub/Sub message attributes.
Added support for stateful pipelines via the new State API.
Added support for timer via the new
Timer API.
Support for this feature is limited to the DirectRunner
in this
release.
2.0.0-beta1 (January 7, 2017)
Version 2.0.0-beta1 is based on a subset of Apache Beam 0.4.0.
Improved compression: CompressedSource
supports reading ZIP-compressed files. TextIO.Write
and
AvroIO.Write
support compressed output.
Added functionality to AvroIO
:
Write
supports the addition of custom user metadata.
Added functionality to BigQueryIO
:
Write
splits large (> 12 TiB) bulk imports into multiple BigQuery
load jobs, allowing it to handle very large datasets.
Added functionality to BigtableIO
:
Write
supports unbounded PCollection
s and can be
used in streaming mode in the DataflowRunner
.
For complete details, please see the release notes for Apache Beam 0.3.0-incubating, 0.4.0, and 0.5.0.
Other Apache Beam modules from the corresponding version can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.