To read data from Cloud Storage to Dataflow, use the
Apache Beam TextIO
or AvroIO
I/O connector.
Include the Google Cloud library dependency
To use the TextIO
or AvroIO
connector with Cloud Storage, include
the following dependency. This library provides a schema handler for "gs://"
filenames.
Java
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
<version>${beam.version}</version>
</dependency>
Python
apache-beam[gcp]==VERSION
Go
import _ "github.com/apache/beam/sdks/v2/go/pkg/beam/io/filesystem/gcs"
For more information, see Install the Apache Beam SDK.
Parallelism
The TextIO
and AvroIO
connectors support two levels of parallelism:
- Individual files are keyed separately, so that multiple workers can read them.
- If the files are uncompressed, the connector can read sub-ranges of each file separately, leading to a very high level of parallelism. This splitting is only possible if each line in the file is a meaningful record. For example, it's not available by default for JSON files.
Performance
The following table shows performance metrics for reading from
Cloud Storage. The workloads were run on one e2-standard2
worker,
using the Apache Beam SDK 2.49.0 for Java. They did not use Runner v2.
100 M records | 1 kB | 1 column | Throughput (bytes) | Throughput (elements) |
---|---|---|
Read | 320 MBps | 320,000 elements per second |
These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, see Beam IO Performance.
Best practices
Avoid using
watchForNewFiles
with Cloud Storage. This approach scales poorly for large production pipelines, because the connector must keep a list of seen files in memory. The list can't be flushed from memory, which reduces the working memory of workers over time. Consider using Pub/Sub notifications for Cloud Storage instead. For more information, see File processing patterns.If both the filename and the file contents are useful data, use the
FileIO
class to read filenames. For example, a filename might contain metadata that is useful when processing the data in the file. For more information, see Accessing filenames. TheFileIO
documentation also shows an example of this pattern.
What's next
- Read the
TextIO
API documentation. - See the list of Google-provided templates.