Stay organized with collections
Save and categorize content based on your preferences.
This page outlines best practices for optimizing a Dataflow
pipeline that reads from
Pub/Sub and
writes to BigQuery.
Depending on your use case, the following suggestions might lead to better
performance.
Initial solutions for pipeline backlogs
When a Pub/Sub to BigQuery pipeline experiences a
growing backlog and can't keep up with incoming messages, you can take the
following immediate steps:
Increase Pub/Sub acknowledgment deadline: For the
associated Pub/Sub subscription, increase the acknowledgment
deadline to a value slightly longer than the
maximum expected message processing time. This prevents messages from being
redelivered prematurely while they are still being processed.
Scale out workers: If the unacknowledged message count and subscription
backlog are growing rapidly, the pipeline's processing capacity is likely
insufficient. Increase the number of Dataflow
workers to handle the message
volume.
Enable exponential backoff: Enable exponential backoff
to improve how the pipeline handles retries for transient issues, making it
more resilient.
Long-term code and pipeline optimizations
For sustained performance and stability, the following architectural and code
changes are recommended:
Reduce getTable calls to BigQuery: Excessive getTable
method calls can lead to rate-limiting and performance bottlenecks. To
mitigate this:
Cache table existence information in worker memory to avoid repeated
calls for the same table.
Batch getTable calls on a per-bundle basis instead of for each
individual element.
Refactor the pipeline code to eliminate the need to check for table
existence for every message.
Use the BigQuery Storage Write API: For streaming pipelines writing to
BigQuery, migrate from standard streaming inserts to the
Storage Write API. The
Storage Write API offers better performance and significantly
higher quotas.
Use the legacy Dataflow runner for high-cardinality jobs:
For jobs that process a very large number of unique keys
(high-cardinality), the legacy Dataflow runner might offer
better performance than Runner v2, unless cross-language transforms are
required.
Optimize the key space: Performance can degrade when pipelines operate
on millions of active keys. Adjust the pipeline's logic to perform work on a
smaller, more manageable key space.
Resource, quota, and configuration management
Proper resource allocation and configuration are critical for pipeline health:
Proactively manage quotas: Monitor quotas and request increases for any
quotas that might be reached during scaling events. For example, consider
the following scaling events:
A high rate of calls to the TableService.getTable or
tabledata.insertAll methods might exceed the maximum queries per
second (QPS). For more information on limits and how to request more
quota, see BigQuery quotas and
limits.
Compute Engine quotas for in-use IP addresses and CPUs might
exceed maximum limits. For more information on limits and how to request
more quota, see the Compute Engine quota and limits
overview.
Optimize worker configuration: To prevent out-of-memory (OOM) errors and
improve stability:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-26 UTC."],[],[],null,["This page outlines best practices for optimizing a Dataflow\npipeline that [reads from\nPub/Sub](/dataflow/docs/concepts/streaming-with-cloud-pubsub) and\n[writes to BigQuery](/dataflow/docs/guides/write-to-bigquery).\nDepending on your use case, the following suggestions might lead to better\nperformance.\n\nInitial solutions for pipeline backlogs\n\nWhen a Pub/Sub to BigQuery pipeline experiences a\ngrowing backlog and can't keep up with incoming messages, you can take the\nfollowing immediate steps:\n\n- **Increase Pub/Sub acknowledgment deadline:** For the associated Pub/Sub subscription, [increase the acknowledgment\n deadline](/pubsub/docs/lease-management) to a value slightly longer than the maximum expected message processing time. This prevents messages from being redelivered prematurely while they are still being processed.\n- **Scale out workers:** If the unacknowledged message count and subscription backlog are growing rapidly, the pipeline's processing capacity is likely insufficient. Increase the [number of Dataflow\n workers](/dataflow/docs/reference/pipeline-options) to handle the message volume.\n- **Enable exponential backoff:** Enable [exponential backoff](/pubsub/docs/handling-failures#exponential_backoff) to improve how the pipeline handles retries for transient issues, making it more resilient.\n\nLong-term code and pipeline optimizations\n\nFor sustained performance and stability, the following architectural and code\nchanges are recommended:\n\n- **Reduce `getTable` calls to BigQuery:** Excessive `getTable` method calls can lead to rate-limiting and performance bottlenecks. To mitigate this:\n - Cache table existence information in worker memory to avoid repeated calls for the same table.\n - Batch `getTable` calls on a per-bundle basis instead of for each individual element.\n - Refactor the pipeline code to eliminate the need to check for table existence for every message.\n- **Use the BigQuery Storage Write API:** For streaming pipelines writing to BigQuery, migrate from standard streaming inserts to the [Storage Write API](/bigquery/docs/write-api). The Storage Write API offers better performance and significantly higher quotas.\n- **Use the legacy Dataflow runner for high-cardinality jobs:** For jobs that process a very large number of unique keys (*high-cardinality*), the legacy Dataflow runner might offer better performance than Runner v2, unless cross-language transforms are required.\n- **Optimize the key space:** Performance can degrade when pipelines operate on millions of active keys. Adjust the pipeline's logic to perform work on a smaller, more manageable key space.\n\nResource, quota, and configuration management\n\nProper resource allocation and configuration are critical for pipeline health:\n\n- **Proactively manage quotas:** Monitor quotas and request increases for any quotas that might be reached during scaling events. For example, consider the following scaling events:\n - A high rate of calls to the `TableService.getTable` or `tabledata.insertAll` methods might exceed the maximum queries per second (QPS). For more information on limits and how to request more quota, see [BigQuery quotas and\n limits](/bigquery/quotas).\n - Compute Engine quotas for in-use IP addresses and CPUs might exceed maximum limits. For more information on limits and how to request more quota, see the [Compute Engine quota and limits\n overview](/compute/quotas-limits).\n- **Optimize worker configuration:** To prevent out-of-memory (OOM) errors and improve stability:\n - Use worker [machine\n types](/dataflow/docs/guides/configure-worker-vm#machine-type) with more memory.\n - Reduce the [number of\n threads](/dataflow/docs/reference/pipeline-options#resource_utilization) per worker.\n - Set a higher [number of\n workers](/dataflow/docs/reference/pipeline-options#resource_utilization) to distribute the workload more evenly and reduce the performance impact of frequent autoscaling events.\n\nWhat's next\n\n- [Develop and test Dataflow pipelines](/dataflow/docs/guides/develop-and-test-pipelines)\n- [Dataflow pipeline best practices](/dataflow/docs/guides/pipeline-best-practices)\n- [Dataflow job metrics](/dataflow/docs/guides/using-monitoring-intf)"]]