The Dataflow SDKs have built-in Read
and Write
transforms for text files.
These transforms assume that the text files to be read or written are newline-delimited,
meaning each "record" in the file is a single line of text ending in a newline character.
You can read and write both local files (meaning files on the system where your Dataflow program runs) and remote files in Google Cloud Storage.
Java
Note: If you want your pipeline to read or write local files, you'll need to
use the DirectPipelineRunner
to
run your pipeline locally.
This is because the Google Compute Engine instances that the
Dataflow service uses to run your pipeline
won't be able to access files on your local machine for reading and writing.
Reading from Text Files
Java
To read from a text file, apply TextIO.Read
to your Pipeline
object.
TextIO.Read
reads the text file and returns a PCollection<String>
.
Each String
in the resulting PCollection
represents one line from the
text file.
PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply( TextIO.Read.named("ReadMyFile").from("gs://some/inputData.txt"));
In the example, the code calls apply
on the Pipeline
object,
and passes TextIO.Read
as the Read
transform. The .named
operation provides a transform name for the read operation, and the .from
operation
provides the file path. The return value of apply
is the resulting
PCollection<String>
named lines
.
As with other file-based Dataflow sources, the TextIO.Read
transform can read
multiple input files. See Reading Input Data
for more information on how to handle multiple files when reading from file-based sources.
Writing to Text Files
Java
To output data to text files, apply TextIO.Write
to the PCollection
that you want to output. Keep the following things in mind when using
TextIO.Write
:
- You may only apply
TextIO.Write
to aPCollection<String>
. You may need to use a simpleParDo
to format your data from an intermediatePCollection
to aPCollection<String>
prior to writing withTextIO.Write
. - Each element in the output
PCollection
will represent one line in the resulting text file. - Dataflow's file-based write operations, like
TextIO.Write
, write to multiple output files by default. See Writing Output Data for more information.
PCollection<String> filteredWords = ...; filteredWords.apply(TextIO.Write.named("WriteMyFile") .to("gs://some/outputData"));
In the example, the code calls apply
on the PCollection
to output, and
passes TextIO.Write
as the Write
transform. The .named
operation provides a transform name for the write operation, and the .to
operation
provides the path to the resulting output files.
Reading from Compressed Text Files
Java
You can use the TextIO.Read
transform with compressed text
files—specifically, files compressed with gzip
and bzip2
.
To read a compressed file, you'll need to specify the compression type:
You specify the compression type by using the method .withCompressionType
.
Pipeline p = ...; p.apply(TextIO.Read.named("ReadMyFile") .from("gs://some/inputData.gz") .withCompressionType(TextIO.CompressionType.GZIP));
TextIO does not currently support writing to compressed files.
If your file has a .gz
or .bz2
extension, you don't need to explicitly
specify a compression type. The default compression type, AUTO
, examines file
extensions to determine the correct compression type for a file. This even works with globs,
where the files that result from the glob may be a mix of .gz
, .bz2
, and
uncompressed types.