IMPORTANT: The pricing model for Cloud Dataflow Shuffle is changing on May 3, 2018. Cloud Dataflow jobs executed prior to May 3, 2018 will be billed according to the previous Shuffle pricing model.
This page describes pricing for Cloud Dataflow. To see the pricing for other products, read the Pricing documentation.
While the rate for pricing is based on the hour, Cloud Dataflow service usage is billed in per second increments, on a per job basis. Usage is stated in hours (30 minutes is 0.5 hours, for example) in order to apply hourly pricing to second-by-second use. Workers and jobs may consume resources as described in the following sections.
Workers and worker resources
Each Cloud Dataflow job uses at least one Cloud Dataflow worker. The Cloud Dataflow service provides two worker types: batch and streaming. There are separate service charges for batch and streaming workers.
Cloud Dataflow workers will consume the following resources, each billed on a per second basis.
- Storage: Persistent Disk
Batch and streaming workers are specialized resources which utilize Compute Engine. However, a Cloud Dataflow job will not emit Compute Engine billing for Compute Engine resources managed by the Cloud Dataflow service. Instead, Cloud Dataflow service charges will encompass the use of these Compute Engine resources.
You can override the default worker count for a job. If you are using autoscaling, you can specify the maximum number of workers to be allocated to a job. Workers and respective resources will be added and removed automatically based on auto-scaling actuation.
In addition, you can use pipeline options to override the default resource settings (machine type, disk type, and disk size) that are allocated to each worker.
Cloud Dataflow services
The Cloud Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. By default, Cloud Dataflow uses a shuffle implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and Persistent Disk storage.
Cloud Dataflow also provides an optional highly-scalable feature, Cloud Dataflow Shuffle, which is available only for batch pipelines and shuffles data outside of workers. Shuffle charges by the volume of data processed. You can instruct Cloud Dataflow to use Shuffle by specifying the Shuffle pipeline parameter.
Similar to Shuffle, the Cloud Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Cloud Dataflow service backend. You instruct Cloud Dataflow to use the Streaming Engine for your streaming pipelines by specifying the Streaming Engine pipeline parameter. Streaming Engine usage is billed by the volume of streaming data processed, which depends on the volume of data ingested into your streaming pipeline and the complexity and number of pipeline stages. Examples of what counts as a byte processed include input flows from data sources, flows of data from one fused pipeline stage to another fused stage, flows of data persisted in user-defined state or used for windowing, and output messages to data sinks, such as to Cloud Pub/Sub or BigQuery.
Additional job resources
In addition to worker resource usage, a job might consume the following resources, each billed at their own pricing, including but not limited to
Future releases of Cloud Dataflow may have different service charges and/or bundling of related services.
See the Compute Engine Regions and Zones page for more information about the available regions and their zones.
|Cloud Dataflow Worker Type||vCPU
(per GB per hour)
|Storage - Standard Persistent Disk
(per GB per hour)
|Storage - SSD Persistent Disk
(per GB per hour)
If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.
4 Cloud Dataflow Streaming Engine uses the Streaming Data Processed pricing unit. Streaming Engine is currently available in beta for streaming pipelines in the us-central1 (Iowa) and europe-west1 (Belgium) regions only. It will become available in other regions in the future.
5 Prior to May 3, 2018, Cloud Dataflow
Shuffle was billed by the amount of data shuffled times the time it took to
shuffle the data and keep it in Shuffle’s memory; the price was $0.0216 per
Gigabyte per Hour. After May 3, 2018, Shuffle is priced exclusively by the
amount of data that our service infrastructure reads and writes in the process
of shuffling your dataset; the pricing unit is Gigabytes with the time
dependency removed from billing consideration. Users with large or very large
datasets should expect to see significant reductions in their total Shuffle
To further encourage the adoption of service-based Shuffle, the first five Terabytes of Shuffle Data Processed are charged at rates reduced by 50%. For example, if your pipeline results in 1 TB of actual Shuffle Data Processed, you are charged only for 50% of that data volume (0.5TB). If your pipeline results in 10 TB of actual Shuffle Data Processed, you are charged for 7.5TB, because the first 5TB of that volume are charged at 50% reduced rates.
You can view the total vCPU, memory, and Persistent Disk resources associated with a job either in the Google Cloud Platform Console or via the gcloud command line tool. You can track both the actual and chargeable Shuffle Data Processed and Streaming Data Processed metrics in the Cloud Dataflow Monitoring Interface. You can use the actual Shuffle Data Processed to evaluate the performance of your pipeline and the chargeable Shuffle Data Processed to determine the costs of the Cloud Dataflow job. For Streaming Data Processed, the actual and chargeable metrics are identical.
Use the Google Cloud Platform Pricing Calculator to help you understand how your bill is calculated.