Write from Dataflow to Bigtable

To write data from Dataflow to Bigtable, use the Apache Beam Bigtable I/O connector.

Parallelism

Parallelism is controlled by the number of nodes in the Bigtable cluster. Each node manages one or more key ranges, although key ranges can move between nodes as part of load balancing. For more information, see Understand performance in the Bigtable documentation.

You are charged for the number of nodes in your instance's clusters. See Bigtable pricing.

Performance

The following table shows performance metrics for Bigtable I/O write operations. The workloads were run on one e2-standard2 worker, using the Apache Beam SDK 2.48.0 for Java. They did not use Runner v2.

100M record | 1kB | 1 column Throughput (bytes) Throughput (elements)
Write 65 MBps 60,000 elements per second

These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, see Beam IO Performance.

Best practices

  • In general, avoid using transactions. Transactions aren't guaranteed to be idempotent, and Dataflow might invoke them multiple times due to retries, causing unexpected values.

  • A single Dataflow worker might process data for many key ranges, leading to inefficient writes to Bigtable. Using GroupByKey to group data by Bigtable key can significantly improve write performance.

  • If you write large datasets to Bigtable, consider calling withFlowControl. This setting automatically rate-limits traffic to Bigtable, to ensure the Bigtable servers have enough resources available to serve data.

What's next