The Dataflow managed service has the following quota limits:
- Each Google Cloud project can make up to 3,000,000 requests per minute.
- Each Dataflow job can use a maximum of 1,000 Compute Engine instances.
- Each Google Cloud project can run at most 25 concurrent Dataflow jobs by default.
- Each Dataflow worker has a maximum limit of logs that it can output in a time interval. See logging documentation for the exact limit.
- If you opt-in to organization level quotas, each organization can run at most 125 concurrent Dataflow jobs by default.
- Each user can make up to 15,000 monitoring requests per minute.
- Each user can make up to 60 job creation requests per minute.
- Each user can make up to 60 job template requests per minute.
- Each user can make up to 60 job update requests per minute.
- Each Google Cloud project gets the following shuffle slots in each region:
- asia-east1: 48 slots
- asia-northeast1: 24 slots
- asia-south1: 64 slots
- asia-southeast1: 32 slots
- europe-west1: 640 slots
- europe-west2: 32 slots
- europe-west3: 40 slots
- europe-west4: 512 slots
- northamerica-northeast1: 384 slots
- us-central1: 640 slots
- us-east1: 640 slots
- us-east4: 32 slots
- us-west1: 384 slots
- us-west2: 24 slots
- us-west3: 24 slots
- others: 16 slots
- Dataflow batch jobs will be cancelled after 30 days.
You can check your current usage of Dataflow-specific quota:
- In the Google Cloud console, go to the APIs & services.
Go to API & Services
- To check your current Shuffle slots quota usage, on the Quotas tab, find the Shuffle slots line in the table, and in the Usage Chart column, click Show usage chart.
The Dataflow service exercises various components of the Google Cloud, such as BigQuery, Cloud Storage, Pub/Sub, and Compute Engine. These (and other Google Cloud services) employ quotas to cap the maximum number of resources you can use within a project. When you use Dataflow, you might need to adjust your quota settings for these services.
Compute Engine quotas
When you run your pipeline on the Dataflow service, Dataflow creates Compute Engine instances to run your pipeline code.
- CPUs: The default machine types for Dataflow are
n1-standard-2for jobs that use Streaming Engine, and
n1-standard-4for jobs that do not use Streaming Engine. FlexRS uses
n1-standard-2machines by default. During the beta release, FlexRS uses 90% preemptible VMs and 10% regular VMs. Compute Engine calculates the number of CPUs by summing each instance’s total CPU count. For example, running 10
n1-standard-4instances counts as 40 CPUs. See Compute Engine machine types for a mapping of machine types to CPU count.
- In-Use IP Addresses: The number of in-use IP addresses in your project must be sufficient to accommodate the desired number of instances. To use 10 Compute Engine instances, you'll need 10 in-use IP addresses.
- Persistent Disk: Dataflow attaches Persistent Disk
to each instance.
- The default disk size is 250 GB for batch and 400 GB for streaming pipelines. For 10 instances, by default you need 2,500 GB of Persistent Disk for a batch job.
- The default disk size is 25 GB for Dataflow Shuffle batch pipelines.
- The default disk size is 30 GB for Streaming Engine streaming pipelines.
- The Dataflow service is currently limited to 15 persistent disks per worker instance when running a streaming job. Each persistent disk is local to an individual Compute Engine virtual machine. A 1:1 ratio between workers and disks is the minimum resource allotment.
- Compute Engine usage is based on the average number of workers, whereas Persistent Disk
usage is based on the exact value of
--maxNumWorkers. Persistent Disks are redistributed such that each worker has an equal number of attached disks.
- Regional Managed Instance Groups: Dataflow deploys your
Compute Engine instances as a Regional Managed Instance Group. You'll need
to ensure you have the following related quota available:
- One Instance Group per Dataflow job
- One Instance Template per Dataflow job
- One Regional Managed Instance Group per Dataflow job
Depending on which sources and sinks you are using, you might also need additional quota.
- Pub/Sub: If you are using Pub/Sub, you might need additional quota. When planning for quota, note that processing 1 message from Pub/Sub involves 3 operations. If you use custom timestamps, you should double your expected number of operations, since Dataflow will create a separate subscription to track custom timestamps.
- BigQuery: If you are using the streaming API for BigQuery, quota limits and other restrictions apply.
Quotas and limits are the same for Dataflow and Dataflow Prime. If you have quotas for Dataflow, then you don't need additional quota to run your jobs using Dataflow Prime.
This section describes practical production limits for Dataflow.
|Maximum number of workers per pipeline.||1,000|
|Maximum size for a job creation request. Pipeline descriptions with a lot of steps and very verbose names may reach this limit.||10 MB|
|Maximum number of side input shards.||20,000|
|Maximum size for a single element (except where stricter conditions apply, for example Streaming Engine).||2 GB|
|Maximum size for a single element value in Streaming Engine.||80 MB|
|Maximum number of log entries in a given time period, per worker.||15,000 messages every 30 seconds|
|Maximum number of custom metrics per Dataflow job||100|
|Length of time that recommendations will be stored.||30 days|
|Streaming Engine Limits||Amount|
|Maximum bytes for Pub/Sub messages.||7 MB|
|Maximum size of a large key. Keys over 64 KB cause decreased performance.||2 MB|
|Maximum size for a side input.||80 MB|
|Maximum length for state tags used by