This page describes pricing for Dataflow. To see the pricing for other products, read the Pricing documentation.
While the rate for pricing is based on the hour, Dataflow service usage is billed in per second increments, on a per job basis. Usage is stated in hours (30 minutes is 0.5 hours, for example) in order to apply hourly pricing to second-by-second use. Workers and jobs may consume resources as described in the following sections.
Workers and worker resources
Each Dataflow job uses at least one Dataflow worker. The Dataflow service provides two worker types: batch and streaming. There are separate service charges for batch and streaming workers.
Dataflow workers consume the following resources, each billed on a per second basis.
- Storage: Persistent Disk
- GPU (optional)
Batch and streaming workers are specialized resources that use Compute Engine. However, a Dataflow job will not emit Compute Engine billing for Compute Engine resources managed by the Dataflow service. Instead, Dataflow service charges will encompass the use of these Compute Engine resources.
You can override the default worker count for a job. If you are using autoscaling, you can specify the maximum number of workers to be allocated to a job. Workers and respective resources will be added and removed automatically based on autoscaling actuation.
In addition, you can use pipeline options to override the default resource settings (machine type, disk type, and disk size) that are allocated to each worker and use GPUs.
The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. By default, Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and Persistent Disk storage.
Dataflow also provides an optional highly scalable feature, Dataflow Shuffle, which is available only for batch pipelines and shuffles data outside of workers. Shuffle charges by the volume of data processed. You can instruct Dataflow to use Shuffle by specifying the Shuffle pipeline parameter.
Similar to Shuffle, the Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Dataflow service backend. You instruct Dataflow to use the Streaming Engine for your streaming pipelines by specifying the Streaming Engine pipeline parameter. Streaming Engine usage is billed by the volume of streaming data processed, which depends on the volume of data ingested into your streaming pipeline and the complexity and number of pipeline stages. Examples of what counts as a byte processed include input flows from data sources, flows of data from one fused pipeline stage to another fused stage, flows of data persisted in user-defined state or used for windowing, and output messages to data sinks, such as to Pub/Sub or BigQuery.
Dataflow also provides an option with discounted CPU and memory pricing for batch processing. Flexible Resource Scheduling (FlexRS) combines regular and preemptible VMs in a single Dataflow worker pool, giving users access to cheaper processing resources. FlexRS also delays the execution of a batch Dataflow job within a 6-hour window to identify the best point in time to start the job based on available resources. While Dataflow uses a combination of workers to execute a FlexRS job, you are billed a uniform discounted rate compared to regular Dataflow prices, regardless of the worker type. You instruct Dataflow to use FlexRS for your autoscaled batch pipelines by specifying the FlexRS parameter.
To help you manage the reliability of your streaming pipelines, Dataflow
snapshots allow you to save and restore your pipeline state.
Snapshot usage is billed by the volume of data stored, which depends on the volume
of data ingested into your streaming pipeline, your windowing logic, and the number
of pipeline stages. You can take a snapshot of your streaming job using the Dataflow
Web UI or the
gcloud command-line tool. There is no additional charge for creating a job from your snapshot
to restore your pipeline's state. For more information, see Using Dataflow snapshots.
Dataflow Prime is a new data processing platform that builds on Dataflow and brings improvements in resource utilization and distributed diagnostics.
A job running Dataflow Prime is priced by the number of Dataflow Processing Units (DPUs) the job consumes. DPUs represent the computing resources that are allocated to run your pipeline.
What is a Dataflow Processing Unit?
A Dataflow Processing Unit (DPU) is a Dataflow usage metering unit that tracks the amount of resources consumed by your jobs. DPUs track usage of various resources including compute, memory, disk storage, data shuffled (in case of batch jobs) and streaming data processed (in case of streaming jobs). Jobs that consume more resources will see more DPU usage compared to jobs that consume fewer resources. While there is no one-to-one mapping between the various resources your job consumes and DPU, 1 DPU is comparable to the resources used by a Dataflow job that runs for one hour on a 1 vCPU 4 GB worker with a 250 GB Persistent Disk.
How do I optimize the number of Dataflow Processing Units used by my job?
You cannot set the number of DPUs for your jobs; DPUs are counted by Dataflow Prime. However, you can reduce the number of DPUs consumed by focusing on the following aspects of your job:
- Reducing memory consumption.
- Reducing the amount of data processed in shuffling steps by using filters, combiners, and efficient coders.
How are Dataflow Processing Units billed?
You are billed by the second for the total number of DPUs consumed by your job during a given hour. The price of a single DPU varies based on the job type— batch or streaming.
How can I limit the number of Dataflow Processing Units that my job consumes?
The total number of DPUs your job can consume is constrained by the maximum number of resources your job can consume. You can also set the maximum number of workers for your job, which limits the number of DPUs that your job can consume.
How is Dataflow Prime pricing different from Dataflow's pricing model?
In the Dataflow model, you are charged for the disparate resources your jobs consume - vCPUs, memory, storage, and the amount of data processed by Dataflow Shuffle or Streaming Engine.
Dataflow Processing Units consolidate these resources into a single metering unit. You are then billed for the number of DPUs consumed based on the job type—batch or streaming. The decoupling of DPUs from physical resources makes it easier to compare costs between jobs and track Dataflow usage over time. For more information, see Using Dataflow Prime.
What happens to my existing jobs that are using the Dataflow pricing model?
Your existing batch and streaming jobs will continue to be billed by the Dataflow model. When you update your jobs to use Dataflow Prime, they will be billed for the DPUs they consume.
Additional job resources
In addition to worker resource usage, a job might consume the following resources, each billed at its own pricing, including but not limited to:
Future releases of Dataflow may have different service charges and/or bundling of related services.
See the Compute Engine Regions and Zones page for more information about the available regions and their zones.
Worker resources pricing
Other resources pricing
The following resources are billed at the same rate for streaming, batch, and FlexRS jobs.
4 This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.
Dataflow Shuffle pricing is based on volume adjustments applied to the amount of data processed during read and write operations while shuffling your dataset. For more information, see Dataflow Shuffle pricing details.
5 This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.
Dataflow Shuffle pricing details
Charges are calculated per Dataflow job through volume adjustments applied to the total amount of data processed during Dataflow Shuffle operations. Your actual bill for the Dataflow Shuffle data processed is equivalent to being charged full price for a smaller amount of data than the amount processed by a Dataflow job. This difference results in the billable Dataflow Shuffle data metric being smaller than the total Dataflow Shuffle data metric.
The following table explains how these adjustments are applied:
|Data Processed by a job||Billing Adjustment|
|First 250 GB||75% reduction|
|Next 4870 GB||50% reduction|
|Remaining data over 5120 GB (5 TB)||none|
For example, if your pipeline results in 1024 GB (1 TB) of total Dataflow Shuffle data processed, the billable amount is calculated as follows: 250 GB * 25% + 774 GB * 50% = 449.5 GB * regional Dataflow Shuffle data processing rate. If your pipeline results in 10240 GB (10 TB) of total Dataflow Shuffle data processed, the billable amount of data is 250 GB * 25% + 4870 GB * 50% + 5120 GB = 7617.5 GB.
Dataflow snapshots will become available in other regions upon General Availability.
You can view the total vCPU, memory, and Persistent Disk resources associated with a job either in the Google Cloud Console or via the gcloud command line tool. You can track both the actual and chargeable Shuffle Data Processed and Streaming Data Processed metrics in the Dataflow Monitoring Interface. You can use the actual Shuffle Data Processed to evaluate the performance of your pipeline and the chargeable Shuffle Data Processed to determine the costs of the Dataflow job. For Streaming Data Processed, the actual and chargeable metrics are identical.
Use the Google Cloud Pricing Calculator to help you understand how your bill is calculated.
- Read the Dataflow documentation.
- Get started with Dataflow.
- Try the Pricing calculator.
- Learn about Dataflow solutions and use cases.