This page outlines best practices for optimizing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Depending on your use case, the following suggestions might lead to better performance.
Initial solutions for pipeline backlogs
When a Pub/Sub to BigQuery pipeline experiences a growing backlog and can't keep up with incoming messages, you can take the following immediate steps:
- Increase Pub/Sub acknowledgment deadline: For the associated Pub/Sub subscription, increase the acknowledgment deadline to a value slightly longer than the maximum expected message processing time. This prevents messages from being redelivered prematurely while they are still being processed.
- Scale out workers: If the unacknowledged message count and subscription backlog are growing rapidly, the pipeline's processing capacity is likely insufficient. Increase the number of Dataflow workers to handle the message volume.
- Enable exponential backoff: Enable exponential backoff to improve how the pipeline handles retries for transient issues, making it more resilient.
Long-term code and pipeline optimizations
For sustained performance and stability, the following architectural and code changes are recommended:
- Reduce
getTable
calls to BigQuery: ExcessivegetTable
method calls can lead to rate-limiting and performance bottlenecks. To mitigate this:- Cache table existence information in worker memory to avoid repeated calls for the same table.
- Batch
getTable
calls on a per-bundle basis instead of for each individual element. - Refactor the pipeline code to eliminate the need to check for table existence for every message.
- Use the BigQuery Storage Write API: For streaming pipelines writing to BigQuery, migrate from standard streaming inserts to the Storage Write API. The Storage Write API offers better performance and significantly higher quotas.
- Use the legacy Dataflow runner for high-cardinality jobs: For jobs that process a very large number of unique keys (high-cardinality), the legacy Dataflow runner might offer better performance than Runner v2, unless cross-language transforms are required.
- Optimize the key space: Performance can degrade when pipelines operate on millions of active keys. Adjust the pipeline's logic to perform work on a smaller, more manageable key space.
Resource, quota, and configuration management
Proper resource allocation and configuration are critical for pipeline health:
- Proactively manage quotas: Monitor quotas and request increases for any
quotas that might be reached during scaling events. For example, consider
the following scaling events:
- A high rate of calls to the
TableService.getTable
ortabledata.insertAll
methods might exceed the maximum queries per second (QPS). For more information on limits and how to request more quota, see BigQuery quotas and limits. - Compute Engine quotas for in-use IP addresses and CPUs might exceed maximum limits. For more information on limits and how to request more quota, see the Compute Engine quota and limits overview.
- A high rate of calls to the
- Optimize worker configuration: To prevent out-of-memory (OOM) errors and
improve stability:
- Use worker machine types with more memory.
- Reduce the number of threads per worker.
- Set a higher number of workers to distribute the workload more evenly and reduce the performance impact of frequent autoscaling events.