Read from Bigtable to Dataflow

To read data from Bigtable to Dataflow, use the Apache Beam Bigtable I/O connector.

Parallelism

Parallelism is controlled by the number of nodes in the Bigtable cluster. Each node manages one or more key ranges, although key ranges can move between nodes as part of load balancing. For more information, see Reads and performance in the Bigtable documentation.

You are charged for the number of nodes in your instance's clusters. See Bigtable pricing.

Performance

The following table shows performance metrics for Bigtable read operations. The workloads were run on one e2-standard2 worker, using the Apache Beam SDK 2.48.0 for Java. They did not use Runner v2.

100 M records | 1 kB | 1 column Throughput (bytes) Throughput (elements)
Read 180 MBps 170,000 elements per second

These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, see Beam IO Performance.

Best practices

  • For new pipelines, use the BigtableIO connector, not CloudBigtableIO.

  • Create separate app profiles for each type of pipeline. App profiles enable better metrics for differentiating traffic between pipelines, both for support and for tracking usage.

  • Monitor the Bigtable nodes. If you experience performance bottlenecks, check whether resources such as CPU utilization are constrained within Bigtable. For more information, see Monitoring.

  • In general, the default timeouts are well tuned for most pipelines. If a streaming pipeline appears to get stuck reading from Bigtable, try calling withAttemptTimeout to adjust the attempt timeout.

  • Consider enabling Bigtable autoscaling, or resize the Bigtable cluster to scale with the size of your Dataflow jobs.

  • Consider setting maxNumWorkers on the Dataflow job to limit load on the Bigtable cluster.

  • If significant processing is done on a Bigtable element before a shuffle, calls to Bigtable might time out. In that case, you can call withMaxBufferElementCount to buffer elements. This method converts the read operation from streaming to paginated, which avoids the issue.

  • If you use a single Bigtable cluster for both streaming and batch pipelines, and the performance degrades on the Bigtable side, consider setting up replication on the cluster. Then separate the batch and streaming pipelines, so that they read from different replicas. For more information, see Replication overview.

What's next