Text Files

The Dataflow SDKs have built-in Read and Write transforms for text files. These transforms assume that the text files to be read or written are newline-delimited, meaning each "record" in the file is a single line of text ending in a newline character.

You can read and write both local files (meaning files on the system where your Dataflow program runs) and remote files in Google Cloud Storage.

Java

Note: If you want your pipeline to read or write local files, you'll need to use the DirectPipelineRunner to run your pipeline locally. This is because the Google Compute Engine instances that the Dataflow service uses to run your pipeline won't be able to access files on your local machine for reading and writing.

Reading from Text Files

Java

To read from a text file, apply TextIO.Read to your Pipeline object. TextIO.Read reads the text file and returns a PCollection<String>. Each String in the resulting PCollection represents one line from the text file.

  PipelineOptions options = PipelineOptionsFactory.create();
  Pipeline p = Pipeline.create(options);

  PCollection<String> lines = p.apply(
    TextIO.Read.named("ReadMyFile").from("gs://some/inputData.txt"));

In the example, the code calls apply on the Pipeline object, and passes TextIO.Read as the Read transform. The .named operation provides a transform name for the read operation, and the .from operation provides the file path. The return value of apply is the resulting PCollection<String> named lines.

As with other file-based Dataflow sources, the TextIO.Read transform can read multiple input files. See Reading Input Data for more information on how to handle multiple files when reading from file-based sources.

Writing to Text Files

Java

To output data to text files, apply TextIO.Write to the PCollection that you want to output. Keep the following things in mind when using TextIO.Write:

  • You may only apply TextIO.Write to a PCollection<String>. You may need to use a simple ParDo to format your data from an intermediate PCollection to a PCollection<String> prior to writing with TextIO.Write.
  • Each element in the output PCollection will represent one line in the resulting text file.
  • Dataflow's file-based write operations, like TextIO.Write, write to multiple output files by default. See Writing Output Data for more information.
  PCollection<String> filteredWords = ...;
  filteredWords.apply(TextIO.Write.named("WriteMyFile")
                                  .to("gs://some/outputData"));

In the example, the code calls apply on the PCollection to output, and passes TextIO.Write as the Write transform. The .named operation provides a transform name for the write operation, and the .to operation provides the path to the resulting output files.

Reading from Compressed Text Files

Java

You can use the TextIO.Read transform with compressed text files—specifically, files compressed with gzip and bzip2. To read a compressed file, you'll need to specify the compression type:

You specify the compression type by using the method .withCompressionType.

  Pipeline p = ...;
  p.apply(TextIO.Read.named("ReadMyFile")
                     .from("gs://some/inputData.gz")
                     .withCompressionType(TextIO.CompressionType.GZIP));

TextIO does not currently support writing to compressed files.

If your file has a .gz or .bz2 extension, you don't need to explicitly specify a compression type. The default compression type, AUTO, examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed types.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow Documentation