Cloud Dataflow

Simplified stream and batch data processing, with equal reliability and expressiveness

Try It Free

Faster development, easier management

Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed. And with its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.

Cloud Dataflow unlocks transformational use cases across industries, including:

  • Clickstream, Point-of-Sale, and segmentation analysis in retail
  • Fraud detection in financial services
  • Personalized user experience in gaming
  • IoT analytics in manufacturing, healthcare, and logistics

Accelerate development for batch and streaming

Cloud Dataflow supports fast, simplified pipeline development via expressive SQL, Java, and Python APIs in the Apache Beam SDK, which provides a rich set of windowing and session analysis primitives as well as an ecosystem of source and sink connectors. Plus, Beam’s unique, unified development model lets you reuse more code across streaming and batch pipelines.

To request a notification of Dataflow SQL’s upcoming alpha availability, please fill out this form. We’ll reach out to let you know when it’s available for your use.


Simplify operations & management

GCP’s serverless approach removes operational overhead with performance, scaling, availability, security and compliance handled automatically so users can focus on programming instead of managing server clusters. Integration with Stackdriver, GCP’s unified logging and monitoring solution, lets you monitor and troubleshoot your pipelines as they are running. Rich visualization, logging, and advanced alerting help you identify and respond to potential issues.


Build on a foundation for machine learning

Use Cloud Dataflow as a convenient integration point to bring predictive analytics to fraud detection, real-time personalization and more through Google Cloud’s AI Platform and TensorFlow Extended (TFX). TFX uses Cloud Dataflow and Apache Beam as the distributed data processing engine to realize several aspects of the ML life cycle.


Use your favorite and familiar tools

Cloud Dataflow seamlessly integrates with GCP services for streaming events ingestion (Cloud Pub/Sub), data warehousing (BigQuery), machine learning (Cloud AI Platform), and more. Its Beam-based SDK also lets developers build custom extensions and even choose alternative execution engines, such as Apache Spark. For Apache Kafka users, a Cloud Dataflow connector makes integration with GCP easy.


Data Transformation with Cloud Dataflow



Automated Resource Management
Cloud Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization; no more spinning up instances by hand or reserving them.
Dynamic Work Rebalancing
Automated and optimized work partitioning dynamically rebalances lagging work. No need to chase down “hot keys” or pre-process your input data.
Reliable & Consistent Exactly-once Processing
Provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity.
Horizontal Auto-scaling
Horizontal auto-scaling of worker resources for optimum throughput results in better overall price-to-performance.
Unified Programming Model
Apache Beam SDK offers equally rich MapReduce-like operations, powerful data windowing, and fine-grained correctness control for streaming and batch data alike.
Community-driven Innovation
Developers wishing to extend the Cloud Dataflow programming model can fork and/or contribute to Apache Beam.
Flexible resource scheduling pricing for batch processing
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.

Cloud Dataflow vs. Cloud Dataproc: Which should you use?

Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. How do you decide which product is a better fit for your environment?
Dataproc vs Dataflow

Cloud Dataproc

Cloud Dataproc is good for environments dependent on specific components of the Apache big data ecosystem:

  • Tools/packages
  • Pipelines
  • Skill sets of existing resources

Cloud Dataflow

Cloud Dataflow is typically the preferred option for greenfield environments:

  • Less operational overhead
  • Unified approach to development of batch or streaming pipelines
  • Uses Apache Beam
  • Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes

Recommended Workloads

Stream processing (ETL)
Batch processing (ETL)
Iterative processing and notebooks
Machine Learning with Spark ML
Machine learning with Cloud AI Platform and TensorFlow Extended (TFX)

Partnerships & Integrations

Google Cloud Platform partners and 3rd party developers have developed integrations with Dataflow to quickly and easily enable powerful data processing tasks of any size.




Sales Force




“Running our pipelines on Cloud Dataflow lets us focus on programming without having to worry about deploying and maintaining instances running our code (a hallmark of GCP overall).”

- Jibran Saithi Lead Architect, Qubit

User-friendly Pricing

Cloud Dataflow jobs are billed in per second increments, based on the actual use of Cloud Dataflow batch or streaming workers. Jobs that consume additional GCP resources -- such as Cloud Storage or Cloud Pub/Sub -- are each billed per that service’s pricing.

1 Batch worker defaults: 1 vCPU, 3.75 GB memory, 250 GB Persistent Disk

2 FlexRS worker defaults: 2 vCPU, 7.50 GB memory, 25 GB Persistent Disk per worker, with a minimum of two workers

3 Streaming worker defaults: 4 vCPU, 15 GB memory, 420 GB Persistent Disk

4 Dataflow Shuffle is currently available for batch pipelines in the following regions:

  • us-central1 (Iowa)
  • us-east1 (South Carolina)
  • us-west1 (Oregon)
  • europe-west1 (Belgium)
  • europe-west4 (Netherlands)
  • asia-east1 (Taiwan)
  • asia-northeast1 (Tokyo)

It will become available in other regions in the future.

5 Dataflow Streaming Engine uses the Streaming Data Processed pricing unit. Streaming Engine is currently available in the following regions:

  • us-central1 (Iowa)
  • us-east1 (South Carolina)
  • us-west1 (Oregon)
  • europe-west1 (Belgium)
  • europe-west4 (Netherlands)
  • asia-east1 (Taiwan)
  • asia-northeast1 (Tokyo)
It will become available in other regions in the future.

6 See Cloud Dataflow Pricing for more information about Data Processed.

Cloud AI products comply with the SLA policies listed here. They may offer different latency or availability guarantees from other Google Cloud services.

Palautteen aihe:

Tämä sivu
Cloud Dataflow
Tarvitsetko apua? Siirry tukisivullemme.