Pub/Sub to BigQuery best practices

This page outlines best practices for optimizing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. Depending on your use case, the following suggestions might lead to better performance.

Initial solutions for pipeline backlogs

When a Pub/Sub to BigQuery pipeline experiences a growing backlog and can't keep up with incoming messages, you can take the following immediate steps:

  • Increase Pub/Sub acknowledgment deadline: For the associated Pub/Sub subscription, increase the acknowledgment deadline to a value slightly longer than the maximum expected message processing time. This prevents messages from being redelivered prematurely while they are still being processed.
  • Scale out workers: If the unacknowledged message count and subscription backlog are growing rapidly, the pipeline's processing capacity is likely insufficient. Increase the number of Dataflow workers to handle the message volume.
  • Enable exponential backoff: Enable exponential backoff to improve how the pipeline handles retries for transient issues, making it more resilient.

Long-term code and pipeline optimizations

For sustained performance and stability, the following architectural and code changes are recommended:

  • Reduce getTable calls to BigQuery: Excessive getTable method calls can lead to rate-limiting and performance bottlenecks. To mitigate this:
    • Cache table existence information in worker memory to avoid repeated calls for the same table.
    • Batch getTable calls on a per-bundle basis instead of for each individual element.
    • Refactor the pipeline code to eliminate the need to check for table existence for every message.
  • Use the BigQuery Storage Write API: For streaming pipelines writing to BigQuery, migrate from standard streaming inserts to the Storage Write API. The Storage Write API offers better performance and significantly higher quotas.
  • Use the legacy Dataflow runner for high-cardinality jobs: For jobs that process a very large number of unique keys (high-cardinality), the legacy Dataflow runner might offer better performance than Runner v2, unless cross-language transforms are required.
  • Optimize the key space: Performance can degrade when pipelines operate on millions of active keys. Adjust the pipeline's logic to perform work on a smaller, more manageable key space.

Resource, quota, and configuration management

Proper resource allocation and configuration are critical for pipeline health:

  • Proactively manage quotas: Monitor quotas and request increases for any quotas that might be reached during scaling events. For example, consider the following scaling events:
    • A high rate of calls to the TableService.getTable or tabledata.insertAll methods might exceed the maximum queries per second (QPS). For more information on limits and how to request more quota, see BigQuery quotas and limits.
    • Compute Engine quotas for in-use IP addresses and CPUs might exceed maximum limits. For more information on limits and how to request more quota, see the Compute Engine quota and limits overview.
  • Optimize worker configuration: To prevent out-of-memory (OOM) errors and improve stability:

What's next