When you create a pipeline, you'll often need to read data from some external source, such as a file in Google Cloud Storage or a BigQuery table. Likewise, you'll want your pipeline to output its result data to a similar external data sink, such as output files in Cloud Storage or BigQuery. The Dataflow SDKs provide transforms that can read data from an external source or write data to an external sink.
The Dataflow SDKs provide Read
and Write
transforms for a
number of common data storage types. In addition, the Read
and Write
APIs are extensible; if you want your pipeline to read from or write to a data storage format that
isn't supported by the built-in transforms, you can build extensions to provide your own
Read
and Write
operations.
Reading Input Data
Read
transforms read data from an external source and return a
PCollection
representation of the data for use by your pipeline.
You can use a Read
transform at any point while constructing your pipeline to create
a new PCollection
, though it will be most common at the start of your program.
Java
Note: Since Read
transforms do not have input PCollection
s,
they are applied to the Pipeline
directly. As usual, the call to apply
returns a PCollection
of the appropriate type, whose elements represent the data. See
Constructing Your Pipeline for more
information.
Reading From Multiple Locations
Many Read
transforms, such as Text, support reading from
multiple input files matching a glob operator you provide. Consider the following use of the
Read
transform, which uses a glob operator (*
) to read all matching
input files in the given location in Google Cloud Storage:
Java
p.apply(TextIO.Read.named("ReadFromText") .from("gs://my_bucket/path/to/input-*.csv");
The above Read
will read all files at the given location in Cloud Storage with the
prefix "input-" and the suffix ".csv".
To read data from disparate sources into a single PCollection
, read each one
independently and then use the
Flatten
transform to create a single PCollection
.
Writing Output Data
Write
transforms write the data in a PCollection
to an external data
source. You'll most often use Write
transforms at the end of your program to output
your pipeline's final results. However, you can use Write
to output a
PCollection
's data at any point in your pipeline.
To use a Write
transform, you call the apply
method on the
PCollection
that you want to write, and pass the appropriate Write
transform as an argument.
Java
When you apply
a Write
transform to a
PCollection
, the return value is an object of type PDone
. The
PDone
object is a trivial result object and can be safely ignored.
Writing To Multiple Output Files
For file-based input and output data, such as Text, Write
transforms write to multiple output files by default. The Cloud Dataflow service always
produces sharded output files automatically. When you pass an output file name to a
Write
transform, the file name is used as the prefix for all output files
that the Write
transform produces.
You can append a suffix to each output file by specifying a suffix to your Write
transform.
Consider the following use of the Write
transform, which writes multiple output
files to a location in Cloud Storage. Each file has the prefix "numbers", a numeric tag, and the
suffix ".csv".
Java
records.apply(TextIO.Write.named("WriteToText") .to("gs://my_bucket/path/to/numbers") .withSuffix(".csv"));
The above Write
will write multiple output files to the given location in Cloud
Storage with the prefix "numbers" and the suffix ".csv".
I/O APIs Included in the Dataflow SDKs
Some Source and Sink APIs are included in the Dataflow SDKs.
Additional I/O APIs
In addition to the I/O APIs, the Dataflow SDKs provides an extensible API that you can use to create your own custom data sources and sinks.
Java
You can create your own custom input sources and output sinks using Dataflow's Source and Sink APIs.