When you create a pipeline, you'll often need to read data from some external source, such as a file in Google Cloud Storage or a BigQuery table. Likewise, you'll want your pipeline to output its result data to a similar external data sink, such as output files in Cloud Storage or BigQuery. The Dataflow SDKs provide transforms that can read data from an external source or write data to an external sink.
The Dataflow SDKs provide
Write transforms for a
number of common data storage types. In addition, the
APIs are extensible; if you want your pipeline to read from or write to a data storage format that
isn't supported by the built-in transforms, you can build extensions to provide your own
Reading Input Data
Read transforms read data from an external source and return a
PCollection representation of the data for use by your pipeline.
You can use a
Read transform at any point while constructing your pipeline to create
PCollection, though it will be most common at the start of your program.
Read transforms do not have input
they are applied to the
Pipeline directly. As usual, the call to
PCollection of the appropriate type, whose elements represent the data. See
Constructing Your Pipeline for more
Reading From Multiple Locations
Read transforms, such as Text, support reading from
multiple input files matching a glob operator you provide. Consider the following use of the
Read transform, which uses a glob operator (
*) to read all matching
input files in the given location in Google Cloud Storage:
Read will read all files at the given location in Cloud Storage with the
prefix "input-" and the suffix ".csv".
To read data from disparate sources into a single
PCollection, read each one
independently and then use the
transform to create a single
Writing Output Data
Write transforms write the data in a
PCollection to an external data
source. You'll most often use
Write transforms at the end of your program to output
your pipeline's final results. However, you can use
Write to output a
PCollection's data at any point in your pipeline.
To use a
Write transform, you call the
apply method on the
PCollection that you want to write, and pass the appropriate
transform as an argument.
Write transform to a
PCollection, the return value is an object of type
PDone object is a trivial result object and can be safely ignored.
Writing To Multiple Output Files
For file-based input and output data, such as Text,
transforms write to multiple output files by default. The Cloud Dataflow service always
produces sharded output files automatically. When you pass an output file name to a
Write transform, the file name is used as the prefix for all output files
Write transform produces.
You can append a suffix to each output file by specifying a suffix to your
Consider the following use of the
Write transform, which writes multiple output
files to a location in Cloud Storage. Each file has the prefix "numbers", a numeric tag, and the
records.apply(TextIO.Write.named("WriteToText") .to("gs://my_bucket/path/to/numbers") .withSuffix(".csv"));
Write will write multiple output files to the given location in Cloud
Storage with the prefix "numbers" and the suffix ".csv".
I/O APIs Included in the Dataflow SDKs
Some Source and Sink APIs are included in the Dataflow SDKs.
Additional I/O APIs
In addition to the I/O APIs, the Dataflow SDKs provides an extensible API that you can use to create your own custom data sources and sinks.
You can create your own custom input sources and output sinks using Dataflow's Source and Sink APIs.