Dataflow pricing

This page describes pricing for Dataflow. To see the pricing for other products, read the Pricing documentation.

To learn how you can save 40% with a three-year commitment or 20% with a one-year commitment, review our committed use discounts (CUDs) page.

Overview

Dataflow usage is billed for resources that your jobs use. Depending on the pricing model that you use, resources are measured and billed differently.

Dataflow compute resources	Dataflow Prime compute resources
Worker CPU and memory (batch, streaming, and FlexRS) Dataflow Shuffle data processed (batch and FlexRS) Streaming Engine Compute Units or legacy Streaming Engine data processed (streaming only)	Data Compute Units (DCUs) (batch and streaming)

Other Dataflow resources that are billed for all jobs include Persistent Disk, GPUs, and snapshots.

Resources from other services might be used for the Dataflow job. Services that are used with Dataflow might include BigQuery, Pub/Sub, Cloud Storage, and Cloud Logging, among others.

Although the rate for pricing is based on the hour, Dataflow usage is billed in per second increments, on a per job basis. Usage is stated in hours in order to apply hourly pricing to second-by-second use. For example, 30 minutes is 0.5 hours. Workers and jobs might consume resources as described in the following sections.

Future releases of Dataflow might have different service charges or bundling of related services.

Dataflow compute resources

Dataflow billing for compute resources includes the following components:

Worker CPU and memory
Dataflow Shuffle data processed for batch workloads
Streaming Engine Compute Units
Streaming Engine data processed

For more information about the available regions and their zones, see the Compute Engine Regions and Zones page.

Worker CPU and memory

Each Dataflow job uses at least one Dataflow worker. The Dataflow service provides two worker types: batch and streaming. Batch and streaming workers have separate service charges.

Dataflow workers consume the following resources, each billed on a per second basis:

CPU
Memory

Batch and streaming workers are specialized resources that use Compute Engine. However, a Dataflow job does not emit Compute Engine billing for Compute Engine resources managed by the Dataflow service. Instead, Dataflow service charges encompass the use of these Compute Engine resources.

You can override the default worker count for a job. If you are using autoscaling, you can specify the maximum number of workers to allocate to a job. Workers and respective resources are added and removed automatically based on autoscaling actuation.

In addition, you can use pipeline options to override the default resource settings, such as machine type, disk type, and disk size, that are allocated to each worker and that use GPUs.

FlexRS

Dataflow provides an option with discounted CPU and memory pricing for batch processing. Flexible Resource Scheduling (FlexRS) combines regular and preemptible VMs in a single Dataflow worker pool, giving users access to cheaper processing resources. FlexRS also delays the execution of a batch Dataflow job within a 6-hour window to identify the best point in time to start the job based on available resources.

Although Dataflow uses a combination of workers to execute a FlexRS job, you are billed a uniform discounted rate of about 40% on CPU and memory cost compared to regular Dataflow prices, regardless of the worker type. You instruct Dataflow to use FlexRS for your autoscaled batch pipelines by specifying the FlexRS parameter.

Dataflow Shuffle data processed

For batch pipelines, Dataflow provides a highly scalable feature, Dataflow Shuffle, that shuffles data outside of workers. For more information, see Dataflow Shuffle.

Dataflow Shuffle charges by the volume of data processed during shuffle.

Streaming Engine pricing

For streaming pipelines, the Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Dataflow service backend. For more information, see Streaming Engine.

Streaming Engine Compute Units

With resource-based billing, Streaming Engine resources are measured in Streaming Engine Compute Units. Dataflow meters the Streaming Engine resources that each job uses and then bills based on the total resources that are used by that job. To enable resource-based billing for your job, see Use resource-based billing. When you use resource-based billing, existing discounts are automatically applied.

When you use Dataflow Prime with resource-based billing, you're billed based on the total resources that each job uses, but the Data Compute Unit (DCU) SKU is used instead of the Streaming Engine Compute Unit SKU.

Streaming Engine data processed (legacy)

Dataflow continues to support the legacy data-processed billing. Unless you enable resource-based billing, jobs are billed by using data-processed billing.

Streaming Engine data-processed billing measures usage by the volume of streaming data processed, which depends on the following factors:

The volume of data ingested into your streaming pipeline
The complexity of the pipeline
The number of pipeline stages with shuffle operation or with stateful DoFns

Examples of what counts as a byte processed include the following items:

Input flows from data sources
Flows of data from one fused pipeline stage to another fused stage
Flows of data persisted in user-defined state or used for windowing
Output messages to data sinks, such as to Pub/Sub or BigQuery

Dataflow compute resource pricing - batch & FlexRS

The following table contains pricing details for worker resources and Shuffle data processed for batch and FlexRS jobs.

¹ Batch worker defaults: 1 vCPU, 3.75 GB memory, 250 GB Persistent Disk if not using Dataflow Shuffle, 25 GB Persistent Disk if using Dataflow Shuffle

² FlexRS worker defaults: 2 vCPU, 7.50 GB memory, 25 GB Persistent Disk per worker, with a minimum of two workers

Dataflow compute resource pricing - streaming

The following table contains pricing details for worker resources, Streaming Engine data processed (legacy), and Streaming Engine Compute Units for streaming jobs.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

³ Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk if not using Streaming Engine, 30 GB Persistent Disk if using Streaming Engine. The Dataflow service is currently limited to 15 persistent disks per worker instance when running a streaming job. A 1:1 ratio between workers and disks is the minimum resource allotment.

⁴ Dataflow Shuffle pricing is based on volume adjustments applied to the amount of data processed during read and write operations while shuffling your dataset. For more information, see Dataflow Shuffle pricing details. Dataflow Shuffle pricing is not applicable to Streaming Engine jobs that use resource-based billing.

⁵ Streaming Engine Compute Units: for streaming jobs that use Streaming Engine and the resource-based billing model. These jobs are not billed for data processed during shuffle.

Volume adjustments for Dataflow Shuffle data processed

Charges are calculated per Dataflow job through volume adjustments applied to the total amount of data processed during Dataflow Shuffle operations. Your actual bill for the Dataflow Shuffle data processed is equivalent to being charged full price for a smaller amount of data than the amount processed by a Dataflow job. This difference results in the billable shuffle data processed metric being smaller than the total shuffle data processed metric.

The following table explains how these adjustments are applied:

Dataflow Shuffle data processed	Billing adjustment
First 250 GB	75% reduction
Next 4870 GB	50% reduction
Remaining data over 5120 GB (5 TB)	none

For example, if your pipeline results in 1024 GB (1 TB) of total Dataflow Shuffle data processed, the billable amount is calculated as follows:

250 GB * 25% + 774 GB * 50% = 449.5 GB * regional Dataflow Shuffle data processing rate

If your pipeline results in 10240 GB (10 TB) of total Dataflow Shuffle data processed, the billable amount of data is:

250 GB * 25% + 4870 GB * 50% + 5120 GB = 7617.5 GB

Dataflow Prime compute resource pricing

Dataflow Prime is a data processing platform that builds on Dataflow to bring improvements in resource utilization and distributed diagnostics.

Compute resources used by a Dataflow Prime job are priced by the number of Data Compute Units (DCUs). DCUs represent the computing resources that are allocated to run your pipeline. Other Dataflow resources used by Dataflow Prime jobs, such as Persistent Disk, GPUs, and snapshots, are billed separately.

For more information about the available regions and their zones, see the Compute Engine Regions and Zones page.

Data Compute Unit

A Data Compute Unit (DCU) is a Dataflow usage metering unit that tracks the number of compute resources consumed by your jobs. Resources tracked by DCUs include vCPU, memory, Dataflow Shuffle data processed (for batch jobs), and Streaming Engine data processed (for streaming jobs). Jobs that consume more resources have more DCU usage compared to jobs that consume fewer resources. One DCU is comparable to the resources used by a Dataflow job that runs for one hour on a 1 vCPU 4GB worker.

Data Compute Unit billing

You're billed for the total number of DCUs consumed by your job. The price of a single DCU varies based on whether you have a batch job or a streaming job. When you use Dataflow Prime with resource-based billing, you're billed based on total resources used instead of bytes process.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Optimize Data Compute Unit usage

You can't set the number of DCUs for your jobs. DCUs are counted by Dataflow Prime. However, you can reduce the number of DCUs consumed by managing the following aspects of your job:

Reducing memory consumption
Reducing the amount of data processed in shuffling steps by using filters, combiners, and efficient coders

To identify these optimizations, use the Dataflow monitoring interface and the execution details interface.

How is Dataflow Prime pricing different from Dataflow pricing?

In Dataflow, you are charged for the disparate resources your jobs consume, such as vCPUs, memory, Persistent Disk, and the amount of data processed by Dataflow Shuffle or Streaming Engine.

Data Compute Units consolidate all of the resources except for storage into a single metering unit. You're billed for Persistent Disk resources and for the number of DCUs consumed based on the job type, batch or streaming. For more information, see Using Dataflow Prime.

What happens to my existing jobs that use the Dataflow pricing model?

Your existing batch and streaming jobs continue to be billed using the Dataflow model. When you update your jobs to use Dataflow Prime, the jobs will then use the Dataflow Prime pricing model, where they are billed for the Persistent Disk resources and for the DCUs consumed.

Other Dataflow resources

Storage, GPUs, snapshots, and other resources are billed the same way for Dataflow and Dataflow Prime.

Storage resource pricing

Storage resources are billed at the same rate for streaming, batch, and FlexRS jobs. You can use pipeline options to change the default disk size or disk type. Dataflow Prime bills the Persistent Disk separately based on the pricing in the following table.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

The Dataflow service is currently limited to 15 persistent disks per worker instance when running a streaming job. Each persistent disk is local to an individual Compute Engine virtual machine. A 1:1 ratio between workers and disks is the minimum resource allotment.

Jobs using Streaming Engine use 30 GB boot disks. Jobs using Dataflow Shuffle use 25 GB boot disks. For jobs that are not using these offerings, the default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode.

Compute Engine usage is based on the average number of workers, whereas Persistent Disk usage is based on the exact value of --maxNumWorkers. Persistent Disks are redistributed such that each worker has an equal number of attached disks.

GPU resource pricing

GPU resources are billed at the same rate for streaming and batch jobs. FlexRS does not currently support GPUs. For information about available regions and zones for GPUs, see GPU regions and zones availability in the Compute Engine documentation.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Snapshots

To help you manage the reliability of your streaming pipelines, you can use snapshots to save and restore your pipeline state. Snapshot usage is billed by the volume of data stored, which depends on the following factors:

The volume of data ingested into your streaming pipeline
Your windowing logic
The number of pipeline stages

You can take a snapshot of your streaming job using the Dataflow console or the Google Cloud CLI. There is no additional charge for creating a job from your snapshot to restore your pipeline's state. For more information, see Using Dataflow snapshots.

Snapshot pricing

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Confidential VM

Confidential VM for Dataflow encrypts data in use on worker Compute Engine VMs. For more details, see Confidential VM overview.

Using Confidential VM for Dataflow incurs additional flat per-vCPU and per-GB costs.

Confidential VM pricing

Prices are global and do not change based on Google Cloud region.

Non-Dataflow resources

In addition to Dataflow usage, a job might consume the following resources, each billed at its own pricing, including but not limited to:

Cloud Storage

Dataflow jobs use Cloud Storage to store temporary files during pipeline execution. To avoid being billed for unnecessary storage costs, turn off the soft delete feature on buckets that your Dataflow jobs use for temporary storage. For more information, see Remove a soft delete policy from a bucket.
Pub/Sub
Datastore
Bigtable
BigQuery
VPC
Cloud Logging

You can route logs to other destinations or exclude logs from ingestion. For information about optimizing log volume for your Dataflow jobs, see controlling Dataflow log volume.

View resource usage

You can view the total vCPU, memory, and Persistent Disk resources associated with a job in the Job info panel under Resource metrics. You can track the following metrics in the Dataflow Monitoring Interface:

Total vCPU time
Total memory usage time
Total Persistent Disk usage time
Total streaming data processed
Total shuffle data processed
Billable shuffle data processed

You can use the Total shuffle data processed metric to evaluate the performance of your pipeline and the Billable shuffle data processed metric to determine the costs of the Dataflow job.

For Dataflow Prime, you can view the total number of DCUs consumed by a job in the Job info panel under Resource metrics.

Pricing calculator

Use the Google Cloud Pricing Calculator to help you understand how your bill is calculated.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

What's next

Read the Dataflow documentation.
Get started with Dataflow.
Try the Pricing calculator.
Learn about Dataflow solutions and use cases.

Request a custom quote

With Google Cloud's pay-as-you-go pricing, you only pay for the services you use. Connect with our sales team to get a custom quote for your organization.

Contact sales