Stay organized with collections Save and categorize content based on your preferences.

Dataflow pricing

This page describes pricing for Dataflow. To see the pricing for other products, read the Pricing documentation.

Overview

Dataflow usage is billed for resources that your jobs use. Depending on whether you're using Dataflow or Dataflow Prime, resources are measured and billed differently.

Dataflow compute resources Dataflow Prime compute resources
  • Worker CPU and memory
    (batch, streaming, and FlexRS)
  • Dataflow Shuffle data processed (batch only)
  • Streaming Engine data processed (streaming only)
Data Compute Units (DCUs)
(batch and streaming)

Other Dataflow resources billed for both Dataflow and Dataflow Prime jobs include Persistent Disk, GPUs, and snapshots.

Resources from other services might be used for the Dataflow job. Services used in conjunction with Dataflow might include BigQuery, Pub/Sub, Cloud Storage, and Cloud Logging, among others.

Although the rate for pricing is based on the hour, Dataflow usage is billed in per second increments, on a per job basis. Usage is stated in hours in order to apply hourly pricing to second-by-second use. For example, 30 minutes is 0.5 hours. Workers and jobs might consume resources as described in the following sections.

Future releases of Dataflow might have different service charges or bundling of related services.

Dataflow compute resources

Dataflow billing for compute resources includes the following components:

For more information about the available regions and their zones, see the Compute Engine Regions and Zones page.

Worker CPU and memory

Each Dataflow job uses at least one Dataflow worker. The Dataflow service provides two worker types: batch and streaming. Batch and streaming workers have separate service charges.

Dataflow workers consume the following resources, each billed on a per second basis:

  • CPU
  • Memory

Batch and streaming workers are specialized resources that use Compute Engine. However, a Dataflow job does not emit Compute Engine billing for Compute Engine resources managed by the Dataflow service. Instead, Dataflow service charges encompass the use of these Compute Engine resources.

You can override the default worker count for a job. If you are using autoscaling, you can specify the maximum number of workers to allocate to a job. Workers and respective resources are added and removed automatically based on autoscaling actuation.

In addition, you can use pipeline options to override the default resource settings, such as machine type, disk type, and disk size, that are allocated to each worker and that use GPUs.

FlexRS

Dataflow provides an option with discounted CPU and memory pricing for batch processing. Flexible Resource Scheduling (FlexRS) combines regular and preemptible VMs in a single Dataflow worker pool, giving users access to cheaper processing resources. FlexRS also delays the execution of a batch Dataflow job within a 6-hour window to identify the best point in time to start the job based on available resources.

Although Dataflow uses a combination of workers to execute a FlexRS job, you are billed a uniform discounted rate of about 40% on CPU and memory cost compared to regular Dataflow prices, regardless of the worker type. You instruct Dataflow to use FlexRS for your autoscaled batch pipelines by specifying the FlexRS parameter.

Dataflow Shuffle data processed

For batch pipelines, Dataflow provides a highly scalable feature, Dataflow Shuffle, that shuffles data outside of workers. For more information, see Dataflow Shuffle.

Dataflow Shuffle charges by the volume of data processed during shuffle.

Streaming Engine data processed

For streaming pipelines, the Dataflow Streaming Engine moves streaming shuffle and state processing out of the worker VMs and into the Dataflow service backend. For more information, see Streaming Engine.

Streaming Engine usage is billed by the volume of streaming data processed, which depends on the following:

  • The volume of data ingested into your streaming pipeline
  • The complexity of the pipeline
  • The number of pipeline stages with shuffle operation or with stateful DoFns

Examples of what counts as a byte processed include the following items:

  • Input flows from data sources
  • Flows of data from one fused pipeline stage to another fused stage
  • Flows of data persisted in user-defined state or used for windowing
  • Output messages to data sinks, such as to Pub/Sub or BigQuery

Dataflow compute resource pricing

The following table contains pricing details for worker resources, Dataflow Shuffle data processed, and Streaming Engine data processed.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

1 Batch worker defaults: 1 vCPU, 3.75 GB memory, 250 GB Persistent Disk if not using Dataflow Shuffle, 25 GB Persistent Disk if using Dataflow Shuffle

2 FlexRS worker defaults: 2 vCPU, 7.50 GB memory, 25 GB Persistent Disk per worker, with a minimum of two workers

3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk if not using Streaming Engine, 30 GB Persistent Disk if using Streaming Engine

4 Dataflow Shuffle pricing is based on volume adjustments applied to the amount of data processed during read and write operations while shuffling your dataset. For more information, see Dataflow Shuffle pricing details.

Volume adjustments for Dataflow Shuffle data processed

Charges are calculated per Dataflow job through volume adjustments applied to the total amount of data processed during Dataflow Shuffle operations. Your actual bill for the Dataflow Shuffle data processed is equivalent to being charged full price for a smaller amount of data than the amount processed by a Dataflow job. This difference results in the billable shuffle data processed metric being smaller than the total shuffle data processed metric.

The following table explains how these adjustments are applied:

Dataflow Shuffle data processed Billing adjustment
First 250 GB 75% reduction
Next 4870 GB 50% reduction
Remaining data over 5120 GB (5 TB) none

For example, if your pipeline results in 1024 GB (1 TB) of total Dataflow Shuffle data processed, the billable amount is calculated as follows:

250 GB * 25% + 774 GB * 50% = 449.5 GB * regional Dataflow Shuffle data processing rate

If your pipeline results in 10240 GB (10 TB) of total Dataflow Shuffle data processed, the billable amount of data is:

250 GB * 25% + 4870 GB * 50% + 5120 GB = 7617.5 GB

Dataflow Prime compute resource pricing

Dataflow Prime is a data processing platform that builds on Dataflow to bring improvements in resource utilization and distributed diagnostics.

Compute resources used by a Dataflow Prime job are price by the number of Data Compute Units (DCUs). DCUs represent the computing resources that are allocated to run your pipeline. Other Dataflow resources used by Dataflow Prime jobs, such as Persistent Disk, GPUs, and snapshots, are billed separately.

For more information about the available regions and their zones, see the Compute Engine Regions and Zones page.

Data Compute Unit

A Data Compute Unit (DCU) is a Dataflow usage metering unit that tracks the number of compute resources consumed by your jobs. Resources tracked by DCUs include vCPU, memory, Dataflow Shuffle data processed (for batch jobs), and Streaming Engine data processed (for streaming jobs). Jobs that consume more resources have more DCU usage compared to jobs that consume fewer resources. One DCU is comparable to the resources used by a Dataflow job that runs for one hour on a 1 vCPU 4 GB worker.

Data Compute Unit billing

You are billed for the total number of DCUs consumed by your job. The price of a single DCU varies based on whether you have a batch job or a streaming job.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Optimize Data Compute Unit usage

You can't set the number of DCUs for your jobs. DCUs are counted by Dataflow Prime. However, you can reduce the number of DCUs consumed by managing the following aspects of your job:

  • Reducing memory consumption
  • Reducing the amount of data processed in shuffling steps by using filters, combiners, and efficient coders

To identify these optimizations, use the Dataflow monitoring interface and the execution details interface.

How is Dataflow Prime pricing different from Dataflow pricing?

In Dataflow, you are charged for the disparate resources your jobs consume, such as vCPUs, memory, Persistent Disk, and the amount of data processed by Dataflow Shuffle or Streaming Engine.

Data Compute Units consolidate all of the resources except for storage into a single metering unit. You're billed for Persistent Disk resources and for the number of DCUs consumed based on the job type, batch or streaming. For more information, see Using Dataflow Prime.

What happens to my existing jobs that use the Dataflow pricing model?

Your existing batch and streaming jobs continue to be billed using the Dataflow model. When you update your jobs to use Dataflow Prime, the jobs will then use the Dataflow Prime pricing model, where they are billed for the Persistent Disk resources and for the DCUs consumed.

Other Dataflow resources

Storage, GPUs, snapshots, and other resources are billed the same way for Dataflow and Dataflow Prime.

Storage and GPU resource pricing

Storage and GPU resources are billed at the same rate for streaming, batch, and FlexRS jobs.

You can use pipeline options to change the default disk size or disk type. Dataflow Prime bills the Persistent Disk separately based on the pricing in the following table.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Snapshots

To help you manage the reliability of your streaming pipelines, you can use snapshots to save and restore your pipeline state. Snapshot usage is billed by the volume of data stored, which depends on the following factors:

  • The volume of data ingested into your streaming pipeline
  • Your windowing logic
  • The number of pipeline stages

You can take a snapshot of your streaming job using the Dataflow console or the Google Cloud CLI. There is no additional charge for creating a job from your snapshot to restore your pipeline's state. For more information, see Using Dataflow snapshots.

Snapshot pricing

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

Non-Dataflow resources

In addition to Dataflow usage, a job might consume the following resources, each billed at its own pricing, including but not limited to:

Viewing resource usage

You can view the total vCPU, memory, and Persistent Disk resources associated with a job in the Job info panel under Resource metrics. You can track the following metrics in the Dataflow Monitoring Interface:

  • Total vCPU time
  • Total memory usage time
  • Total Persistent Disk usage time
  • Total streaming data processed
  • Total shuffle data processed
  • Billable shuffle data processed

You can use the Total shuffle data processed metric to evaluate the performance of your pipeline and the Billable shuffle data processed metric to determine the costs of the Dataflow job.

For Dataflow Prime, you can view the total number of DCUs consumed by a job in the Job info panel under Resource metrics.

Pricing calculator

Use the Google Cloud Pricing Calculator to help you understand how your bill is calculated.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

What's next

Request a custom quote

With Google Cloud's pay-as-you-go pricing, you only pay for the services you use. Connect with our sales team to get a custom quote for your organization.
Contact sales